Extracting Additional Information from Lecture Recordings
CESNET
technical report number 11/2006
also available in PDF,
PostScript, and
XML formats.
Stanislav Sumec
28.11.2006
1 Abstract
This technical report proposes a system enabling synchronization of the recordings with slide presentations, extraction of particular slides, and the analysis of their content eventually. The methods applied are based on the on-line or off-line processing of the video captured from a high-definition camera. As such, it offers an appropriate supplement to the traditional approaches which require non-trivial preparation steps on the presenter's side. The basic building blocks of the system as well as the technical decisions are given.
Keywords: Lecture recordings, slide detection, transition effect, content extraction.
2 Introduction
Lecture recordings play a significant role in the distant education support. Existing technologies make it possible to record both audio and video with sufficient quality. It is also possible to store all the recordings in media content repositories. Therefore, the crucial task is to provide easy access to the stored material. That supposes not only the direct playback of the recordings, but also their efficient indexing and searching, summarization, content analysis and other advanced features.
Many lecturers take advantage of a beamer for their presentations. Technically speaking, the presentations are sequences of the static frames (often extended by transition effects, animations, etc.) and/or video sequences. Specialized applications such as Microsoft PowerPoint or OpenOffice.org Impress make it possible to design the presentations intuitively. Many materials are also presented in the PDF format. In addition to their primary purpose, the data contained in the presentation can be used for further processing of the recordings, e.g., video editing, indexing, etc. This direction forms one of the main motivations of the presented work.
There exist several methods to obtain the presentation itself and the data necessary for its synchronization with the lecture recording. Usually a lecturer can provide the source presentation for further processing. This data should be combined with the material in non-electronic forms captured by the camera. A challenging task is to synchronize particular slides with lector's talk. Specialized tools implement this functionality in particular cases. For example, Microsoft Producer is able to record a lecture together with the presentation created with PowerPoint. Thus, presenter's actions such as PPT slide swapping can be easily recorded. However, there is no universal solution bridging all the presentation formats. Moreover, all the tools need either installation on the presenter's own computer, or the strict use of one presentation computer for all the presentations.
Other possibility is to synchronously record a video of projected data during the lecture. Several data sources can be used for this purpose, e.g., signal sent to the projector, separated camera focused on the silver screen, or the camera catching both the lector and the silver screen. This paper deals with the latter case which combines the advantages of a single source (no need to record various data sources) with the ability to catch lector's actions - pointing, gestures, etc. Other advantages of this method are as follows:
- it does not depend on the format of slides,
- no extra time is required during the lecture preparation.
A limiting factor for this technique is the quality of recording - the lector and the silver screen have to be recorded with a sufficient resolution. Fortunately, cameras with high resolution (1440x1080 pixels), such as Sony HDR-FX1, have become available in last years. Figure shows an example of the source data.
3 System architecture
The task of extracting additional graphical information from video can be split into the several independent subtasks (the basic schema is presented in Figure). Video stream recorded with high resolution camera, which catches the lector and the silver screen, provides the input of this system. The system produces the results mentioned above (synchronization, content extraction, etc.).
The area of the silver screen has to be identified in the source picture first. The presentation and other possible information are projected into this region. The video stream of the presentation can be then cut out from the original video stream. However, it is necessary to perform a correction of the perspective that can be biased by relative positions of the camera and the screen. The fist step can be skipped, if an independent camera is used for the recording of the silver screen.
A key part of the system is the block that provides the slides-swap detection. The timeline describing when each particular slide has been shown is produced by this component. This data can be useful for a compression of the presentation video stream, because only intervals with slide transitions could be stored. However, the synchronization of the presentation with the rest of the lecture is the main benefit. It is useful especially for the indexing which enables finding the part of the lecture corresponding to a particular slide. In the off-line presentation, the slides can be simultaneously displayed during a playback of the recorded lecture. Further applications of the synchronization data can be found in the post processing of recordings, such as speech recognition or video summarization.
The slides swap detection block should be able to process various kinds of presentations. The presentations with static frames are usually used. Also the presentations with various transition effects, which increase their interest for an auditory, are very popular nowadays. Except static frames and the transition effects, presentations may contain animations and video sequences. Such parts of the presentation have to be recognized and stored in this phase as well.
The next step of the processing is recognition of slides content. The expected result is a text or other data describing objects contained in the particular slides. Such information can be also applied for indexing and searching.
Finally, the activity of the lector can be analyzed. The lector could for example modify the content of some slides. Further gesticulation, pointing to selected parts of slide (text, images, etc.), or other activities can be detected.
4 Processing description
Identification of the silver screen position can be carried out using edge detection algorithms [Hea96]. For example Canny detector [Can86] was used in an experimental implementation of the proposed system. It is also possible to apply other image-processing methods here, e.g. Hough transform [BB82]. The result of such algorithms are lines corresponding to the detected edges. The final detection can work with the expected shape and dimensions of projection plane. In addition, it can be assumed that the position of this area remains constant throughout the whole recording. If the configuration of the auditorium is stable, the position of the projection area can be preset manually. However, the automatic detection is more suitable for eliminating minor deviations of the position which are typical for flexible rooms. An example of the edge detection and the ideal localization of the silver screen is presented in Figure.
Corrections of the perspective projection follow. Although it is possible to cut out the projection area directly, viewers appreciate the elimination of the deformation caused by the perspective. This correction is also important for further processing such as text recognition. The simple deformation of the area according to the known aspect ratio usually suffices, more advanced techniques include the assessment of the projection matrix given by the auditorium configuration [Fau93]. However, the correction can be also performed without any additional information about the configuration [SSM01]. In fact, the composed transformation, which cover transformations from slide position to projection plane and from projection plane to camera pixel, have to be found. Few known calibration points can be used for the computation of such transformation. As mentioned above, it is necessary to use cameras with high resolution for the slides to be readable. The application of the perspective correction is demonstrated in Figure.
If a presentation contains only static frames, the slide-swap detection can be performed using the method of total differences of the subsequent frame. A difference higher than a preset threshold means the swap of slides in this case. The threshold can be defined as a constant or can be evaluated dynamically according to the values obtained from the whole recording. However, the simple computation of the differences between two subsequent frames is error-prone. The evaluation can be affected by the noise contained in the video stream, or by the fact, that a slide swap can sometimes be captured in more than two subsequent frames. It can be, for example, caused by frame interlacing in the camera, or by video compression used. Therefore, it is better to mark the time point of the slides swap, when the differences of the following frames are stabilized. An example of an undesirable frame transition is demonstrated in Figure.
Several experiments were performed to verify this method. Figure shows the results obtained from a part of the real lecture. The sum of pixel-difference squares was used as a frame-difference criterion. As can be seen from the graph, there are clear peaks which correspond to the slide swaps. Preliminary experiments proved that this method provides satisfactory results for presentations with static slides.
However, the presentations frequently contain transition effects. Theses effects can be applied for the swaps, thus the frames are swapped fluently. Other possibility is partial animation of slides, when certain parts of the slides are displayed step by step. Typically, particular items are shown in this way. While the first variant can be detected by means of the above-mentioned method with minor modifications, alternative techniques have to be found for the second case. Instead of the detection of global changes, local changes with detailed information about the final state of each particular object (e.g. items of text) will be tackled. The resulting timeline will contain not only the detection of the particular slides but also their segmentation into the parts presented separately. The optical flow methods [BB95], which are used for camera motion estimation or object tracking, can be applied for this segmentation.
A method using KLT tracker was tested in the experimental system. The algorithm selects features which are optimal for tracking, and keeps track of these features. Joining of the features extraction and tracking of these features is the main idea of this algorithm. The good features are such texture patches, where high variation of intensity is found e.g. corners. When the features are tracked between frames, some typical cases can occur. The first one, position of the features does not change, the projected slide stay identical. In other case, some features are moved, this signals animation in progress. Features can be also lost, it usually occurs, when the slides are exchanged. And finally, new features are detected, when some animation in slide starts. Total detection of slide and animation detection can be estimated from a statistical evaluation of all these mentioned cases.
When individual slides are detected, their content can be recognized. The text on the slides is important for indexing of recordings. OCR (Optical Character Recognition) techniques [MNY99] are applied for the text recognition. It is possible to take advantage of the previous segmentation step. Further, a comparison with predefined patterns is performed. Other target objects can be recognized in slides. For example, it is possible to find elementary geometrical objects, which are usually used in block diagrams etc. The description of objects displayed on the slides can be also indexed and searched.
5 Conclusions and future directions
The goal of this report was to sketch out the system for an automatic detection and extraction of information from presentations contained in the lecture recordings. The proposed techniques are independent of the presentation format. Furthermore, no additional time is required for preparing the recording. The resulting data can be used as a supporting material for indexing. It can be also used for advanced video summarization. In addition, it is possible to use the system for restoration of the presentations, if the original presentation is not available.
The experimental system for a verification of the proposed methods is implemented. Some in-house built tools, such as DigILib and AVFile, are used for the data processing. The silver screen localization and correction of the perspective projection is working. Intel Math Kernel Library is used for computation of projection transformation. The slides swap detection module is fully implemented. The basic methods of swap identification and the recognition of partial animations and transition effects are verified. The further development of the system will deal with the detection and capturing of the lector's actions related to the projected presentation.
6 References
| [Hea96] | Heath M., Sarkar S., Sanocki T. and Bowyer K. Comparison of edge detectors: A methodology and initial study. In CVPR'96, 1996, p. 143-148. |
| [BB82] | Ballard D., Brown C.: Computer Vision, Prentice-Hall, 1982. Chap. 4. |
| [Fau93] | Faugeras O. Three dimensional computer vision, a geometric viewpoint. MIT Press, 1993. |
| [SSM01] | Sukthankar R., Stockton R.G., Mullin M.D. Smarter Presentations: Exploiting Homography in Camera-Projector Systems. In Proceedings of International Conference on Computer Vision, 2001. |
| [BB95] | Beauchemin, S.S. and Barron, J.L. The computation of optical flow. ACM Comput. Surv. 27(3), 1995, p. 433-466. |
| [Can86] | Canny, J. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 1986. |
| [MNY99] | Mori, S., Nishida, H., Yamada, H. Optical Character Recognition. Wiley-Interscience, 1999. |