Integration of Externally Produced Metadata into Multimedia File Search-engine
CESNET
technical report number 31/2007
also available in PDF,
PostScript, and
XML formats.
Michal Krsek
December 7, 2007
1 Abstract
The previous version of our video search-engine worked with only metadata that are inserted directly into multimedia files (such as the clip title or the author's name). This approach ensures good metadata relevance since it is the author himself who puts the data into the file content. On the other hand, the use of "internal" metadata only limits the amount of information related to the particular multimedia entry. We miss mostly metadata entered by the users or data originated through the course of further processing of the particular record (transcoding, webpublishing). For this reason, we integrated into the system two possibilities of produced metadata integration. Implementation of the two possibilities ensues from different levels of trust in the metadata providers.
Keywords: search, video, audio, Internet, metadata, podcast, media RSS
2 The Search-engine Architecture
The architecture of the search engine [Dol03] is distributed. Individual subsystems communicate in batches using SCP and HTTP protocols, the data format being partially text and partially XML.
2.1 Subsystem Functions
2.1.1 Crawlers
The crawlers gather addresses (URL) of multimedia files that are accessible on the Internet. As for the function, they can be compared to the crawlers used by fulltext search-engines (such as Google). The difference lies in the results utilization. While the fulltext search-engines crawlers use all the data gathered on the visited pages, our crawlers pass on only the addresses (for the further processing, the lists of addresses is exported as a text file where each address is located in a single line).
The address of a multimedia file suitable for further processing is marked out by its suffix and eventually by content-type attribute in the web server response.
Technically, the crawlers are made of several (the number is increasing) PC servers running under the Debian Linux operating system. The application itself is a proprietary system developed by our research partner, the Jyxo Company.
2.1.2 Work Database
The work database forms a data platform for other systems (distillers and the fulltext database). The distillers pick up the addresses to be processed from the work database and give back the distilled metadata (and thumbnails). Data in the work database serve also for statistic purposes and for duplicity elimination.
Technically, it is an efficient PC server running with the Windows Server operating system and the MS SQL Server database. This subsystem might be replaced by any SQL database with sufficient performance and implemented transaction query processing.
The character of data saved in the work database determines the priority of the metadata saved in the files to metadata saved in Media RSS and podcasts. Metadata distilled directly from the multimedia files are saved in the table in triplets: reference (in the table with URL), name, value. Externally produced metadata are saved in a different table in fours: reference (in the table with podcasts), reference (in the table with URL), name, value.
2.1.3 Distillers
The distillers perform the harvest of metadata and thumbnails from multimedia files. Typical distiller transaction looks like as follows:
- Picking up the address from the work database.
- Passing of the address to the OLE object of an appropriate multimedia player.
- Getting the data from the OLE object of the appropriate multimedia player.
- Saving of metadata into the work database.
- Passing of the address to the modified mplayer and saving of the thumbnail.
- Thumbnail conversion and its saving into the work database.
There are more distillers operating at a time [Hav07]. Concrete number depends on the occupancy rate of the client computers in the labs of cooperating universities. It varies from several computers to more than one hundred instances. The distiller is a virtual machine on which Windows XP and multimedia players (Real Player, Windows Media Player, and QuickTime Player) are installed. The application logic is written in C# for .NET version 2.0 platform. The virtual machine was chosen because of easy portability among different platforms and because of security reasons (separation from host operating system).
The distillers are not synchronized; their cooperation (making sure that more distillers will not be working on the same file) takes place through the transactions in the work database.
2.1.4 Picture Database
The picture database is a replication of the work database and contains only the thumbnails of the multimedia files. These pictures are consequently linked in the user interface of the search-engine. The reason for the implementation of the described replication is that the working load of the systems varies (typically in dependence on the number of operating distillers). The database separation enables providing the users with the same quality of service independently of the working load of the database caused by the distillers. At the same time, we can scale the system (only reading of data is concerned) quite easily for the end users, independently of the work database performance.
The picture database is a PC server with the Windows Server 2003 operating system, MS SQL server and IIS.
2.1.5 Fulltext Database
The data are transmitted from the work database into the fulltext database in batches through the SCP protocol and in the XML format. Because of the large amount of data (tens of GB) and the export character (data transformations are performed at the output), the export from the work database takes tens of hours. This is the reason why the process is carried out in batches, approximately once a week. When needed, we export just the data modified between the exports (according to a time mark of the last change to the entry by the distiller).
The fulltext database runs on the PC server with the Debian Linux operating system, it is a component developed by our research partner, the Jyxo Company.
Web servers and the fulltext database communicate through HTTP protocol, outputs, which the database provides, are in the XML format. It is, therefore, easy to present the search-engine output in a requested interface on the web server operators' side.
2.1.6 Web Servers
Web servers are the only component of the search-engine the end user communicates with. Web server, which intends to query the search-engine, can be operated on any platform that enables XML transformations.
Typically, the web server is operated by a third party, which uses our search-engine.
3 Ways of Externally Produced Data Inserting
We got ready two ways of inserting the externally produced data in the past year. These two methods differ architectonically in the point of entry and their application depends on the degree of trust of the search-engine operator in the metadata provider.
3.1 Inserting through Media RSS or Podcast Feeds
Inserting through Media RSS or podcast feeds is designed for the integration of metadata from the providers that the operators have no relationship with. In enables partial or fully automatic relegation of metadata from many sources.
Media RSS and podcasts are special formats based on XML, which automate announcement of new content within one provider. Apart from the metadata, they also contain the address of the relevant source or the refresh frequency. The formats can both be used as the aggregators of more sources.
There are two ways in which the search-engine operator can learn about the particular source existence. The first one is acquiring its address through crawlers, browsing the web - the addresses are displayed on the web. The second way is sending the address by the metadata provider to the address of the search-engine operator (e.g. through e-mail).
The feeds are processed by a new component of the system, the feed processor.
3.1.1 Feed Processor
The feed processor browses through Media RSS and saves the addresses of multimedia entries as well as the metadata into the data structure in the work database. This data structure is in its architecture similar to the data structure used for metadata gathered by the distillers; however, there are two separate data bases there.
The feed processor functionality is as follows:
- Take the address of the feed.
- Download the content of the feed (appropriate XML).
- For each entry that is not in the database
- save the address of the entry and the metadata into feed part of the work database.
- save the address of the entry into the original part of the work database.
- Go to the next feed scheduled for renewal.
Due to the fact that the entry address is saved to the original part of the work database, the distillers will start working with it. They will find out whether the entry is accessible and whether it contains further metadata. The address will appear in the search results only after it passes successfully through the distiller. Through this method, we eliminate entering metadata for multimedia files that are not available.
Transfer between work and fulltext database takes into account both metadata entered directly in a file and metadata entered by the feed processor. Logic of the export process ensures the priority of the metadata saved directly in the file. Technical metadata (such as the resolution, codec or format) that can contradict each other are transferred in preference from the table of metadata distilled by the distillers. In the case of their non-existence in this table, they are exported from their first occurrence in the table filled by the feed processor.
>From the viewpoint of the search-engine logics, other metadata can not contradict each other, and are thus exported altogether. In case there exist metadata harvested directly by the distillers, they are exported in the structure as named (the user can see them in the search results). Other metadata are exported as unnamed; the search-engine can search them, however the user is served named metadata only in the output.
When there are no metadata gathered by the distillers, their first occurrence in the table filled by the feed processor is saved into the named data.
3.2 Inserting into Fulltext Database
The second, functionally easier method of externally produced metadata import into the search-engine is inserting the metadata into the fulltext database interface input. Since the transfer of data between the work and fulltext database takes place in batches and through files in XML format, it is possible to append the concrete metadata file to the export batch.
Fulltext database will treat the content of appended metadata in the same way it treats the work database output. The only check, which will be performed, will look to the XML file integrity and validity. That is why we use this method just in cases of trustworthy metadata providers who deliver credible metadata of high quality.
There is no specific XML format set up; thanks to the XML transformations we accept almost every valid XML document that has attributes corresponding to Dublin Core metadata scheme.
The metadata transfer (XML files) itself is carried out by the means of electronic mail; however, it is possible to use any automated mechanism.
4 Conclusion
The system was put into routine operation at the end of October 2007. Currently, at the end the year 2007, we are working with several tens of podcasts and we are feeding the search-engine with the selected content of CESNET videoserver.
In the course of test operation we found out that podcats represent an interesting source of links for the search-engine. The authors, who publish their works through podcasts, ignore the possibility of inserting the metadata into the multimedia files. This behavior leads the situation when respective files, which are linked outside of the podcast, can not be reached by the users.
We expect the system to evolve in two spheres. In the sphere of user interface, we are preparing web pages that will describe the search-engine functionality and will possibly enable user output experiments. We assume its implementation in the year 2008.
In the sphere of further improvement of the search engine quality, we are preparing the tests of direct search in the videodata. We expect to publish a technical report with the test results in years 2008 and 2009.
References
| [Dol03] | Doležal I., Illich M., Krsek M.: Internet search in multimedia data CESNET technical report 19/2003, Praha: CESNET, 2003. |
| [Hav07] | Havlíček J., Krsek M.: Operation of a Distiller System in a Distributed Environment - The DistillerGRID. CESNET technical report 1/2007, Praha: CESNET, 2007. |