Ogg Vorbis: Subjective assessment of sound quality at very low bit rates

CESNET technical report number 27/2006
also available in PDF, PostScript, and XML formats.

Jakub Svítek
13.12.2006

1   Abstract

This technical report is focused on comparison of sound quality of open source audio compression format Ogg Vorbis on very low bit rates (32 and 64 kbps) with three other widely used formats (Windows Media Audio, MPEG-4 AAC and MPEG-1 layer 3). In the first part is a brief description of Vorbis algorithm. Then the MUSHRA test method is described. The third chapter is about entire preparation process of the listening test. The results are presented in a last chapter.

Keywords: Ogg Vorbis, lossy audio compression, subjective assessment of sound quality, MUSHRA method.

2   Introduction

Ogg Vorbis is a fully open, non-proprietary, patent-and-royalty-free compression audio format. The parameters places Vorbis in the same competitive class as WMA (Windows Media Audio), MPEG-4 AAC (Advanced Audio Coding) or still very popular but less efficient MPEG-1 layer 3 (also known as MP3). Vorbis is the part of Ogg family which is group of patent-free multimedia formats developed by Xiph.Org Foundation. The bit stream format for Vorbis I was frozen on May 8th 2000. All bit streams encoded since will remain compatible with all future releases of Vorbis.

[Figure]

Figure 1: Block diagram of the Vorbis I algorithm.

In Figure you can see one of the possible implementations of Vorbis I encoder. It is based on vector quantization (VQ) and transformation with overlapping windows, namely Modified Discrete Cosine Transform (MDCT). Windows may have one of the specified lengths - 2048 or 512 samples. The shorter one is used only while coding a critical music signal with sudden changes in the time domain (like percussion, etc.). After transformation to frequency domain the signal is analyzed by psycho-acoustic model and inaudible part of a spectrum is removed. Then the floor vector is generated for each of the channels. This vector is a low-resolution representation of the audio spectrum for the given channel in the current frame. Floor represents the curve as a piecewise linear interpolated representation on a dB amplitude scale and linear frequency scale. The next step in the encoding process is subtraction of floor curve out of the audio spectrum. Then only the fine structure of the spectrum remains. It is called residue. The residue vectors of both channels are transformed from Cartesian to polar representation. This process is called channel coupling. Afterwards it is all coded by cascaded (multi-pass) vector quantization. The results (including VQ code books) are then coded with Huffman algorithm to eliminate even more redundancy. The final product of the entire process is the raw Vorbis packet. Finally those packets are encapsulated into universal Ogg container and included content is ready for distribution.

Decoding of Vorbis does not need as much CPU resources as MPEG-like formats but has higher memory demands. Decoder does not include any fixed code books (neither for VQ nor for Huffman coding). They are specific for every data stream and are included in the header at its beginning. Code books have to be stored in the memory for the entire decoding time. More information about Ogg Vorbis algorithm can be found in [Vor00].

3   Test methodology

Subjective assessment of sound quality is the best way how to evaluate encoders performance. As a result of a heavy compression we always have to expect some kind of impairment. The main objective of every compression algorithm is to make this impairment as much pleasant for a human ear as possible. There are several known methods used for subjective assessment of audio codecs quality (see [Sv05]).

3.1   Selection of test method

I found the EBU MUSHRA method the most suitable for testing compressed audio samples with a very low bit rates (see [Ebu00]). MUSHRA is a double-blind MUlti Stimulus test method with Hidden Reference and hidden Anchors. Within one test set there are always one known high-quality reference, one hidden reference, at least one anchor (intentionally impaired signal) and one sample for each audio system under test. This method is intended for evaluating medium and large impairments by comparison of unknown samples with a high-quality reference but also by a direct paired comparison between them. The assessors can switch at will between the reference signal and any of the systems under test as long as they need to make their decision. Then they are asked to judge their degree of preference for one type of artefact versus some other type of artefact. Listeners are expressing their opinion by placing a cross-mark on a continuos grading scale with five anchor points. Each of these points represents different level of sound quality (see Figure).

[Figure]

Figure 2: The grading scale.

3.2   Selection of audio encoders

Selection of compression formats and their specific encoders is a key part of the test preparation phase. Due to extreme time consumption for this type of tests I decided to compare Ogg Vorbis with only three most widespread audio formats - MP3, AAC and WMA. I picked the following particular encoders:

3.3   Selection of test materials

The choice of test material is crucial to the success of the tests. It is important to choose audio signals with some critical elements (such as percussion) to be sure that assessors will be able to even identify the compressed sample. Another important aspect is artistic (intellectual) content of a material. It should not be very attractive or annoying because then it can divert listeners attention. To keep a reasonable length of an entire test session I decided to pick only five musical sequences but I still tried to preserve diversity of material. Here is a brief list of chosen music samples:

3.4   Compression of test materials

First of all I had to choose compression mode supported by all four encoders to provide same conditions. Finally my choice was one-pass VBR (Variable Bit Rate). Then I had to select target bit rates. Because this test was intended for comparison of codecs on low bit rates my choice was 32 and 64 kbps (kilobits per second). In all cases (except AAC) I used command-line interfaces of encoders to get more control over entire encoding process. To compress testing samples into the AAC format I used iTunes on Mac OS X. Afterwards I had to prepare anchors. I did it by applying low-pass filter (3.5 kHz) on a reference samples. The signals were then impaired enough to serve as a good anchors. Result of this part of preparation phase was the set of 60 audio files (2 bit rates x 5 music samples x (reference + anchor + 4 codecs)).

3.5   The test interface

To give listeners maximal control over the test process I have created the test application with a graphical interface (see Figure). To do this I used Macromedia Flash MX Professional 2004 Trial. Before insertion of all the samples into the application I have converted them back to WAVs to avoid possible problems with decoding during the tests. Of course the impairment caused by previous compression was still present.

[Figure]

Figure 3: The test interface.

Usage of the test interface is very intuitive. Listener is guided through the entire session according to the MUSHRA methodology. Graphical interface consists of three main parts - buttons with samples, field with instructions and a navigation part. In the first part we can find A-F buttons with randomly assigned samples and the REF button where is always the known high-quality reference. In the second part is located text with instructions for assessors. The last part of GUI serves to proceed to the next set of samples. There is also an information about current position in the test session.

3.6   Technical realization of the test

The test application was running on a very quiet laptop Apple PowerBook G4 12" in a full screen mode. Although PowerBook has a quite good sound output for this purpose it was not enough. That was the reason why I decided to use professional external USB sound card which is part of the multipurpose device M-Audio Ozone. It has a studio quality A/D and D/A converters (24 bit / 96 kHz) and very good signal to noise ratio (105 dB). But the most important part of production chain were the headphones. Therefore I chose professional studio headphones Beyerdynamic DT 770. Thanks to their closed construction the common ambient noise was in most cases almost entirely eliminated so the demands on the testing environment were lower. Then it was not necessary to place the test in some studio with very low noise level. The common room without anything that generates noise was good enough. In Figure you can see a technical setup of the listening test.

[Figure]

Figure 4: Technical setup of the test.

3.7   Organization of the listening test

Due to technical setup only one person could be tested at one time. Every assessor got the paper form with the proper scales (see Figure) to which he should mark the evaluations. Listening test had two phases. First one was the training phase. Its main purpose was to get familiar with the interface, testing method and the typical artifacts caused by compression of audio signal. Assessor had a chance to try evaluation on two training sets of samples. The second phase was the very listening test. It consisted of eight sets. In each set there was the same music sequence on all of the buttons and the same bit rate was used. Only difference between samples was in used codec. Each test session was about twenty minutes long (depending on a listeners decisiveness).

4   Test results

The main goal of this experiment was to compare sound quality of Ogg Vorbis with other widely used audio codecs on very low bit rates. Almost thirty persons (exactly 27) passed this subjective test. Because of very high organization demands the entire testing lasted almost two months. Listeners were mostly between age of 18 and 26 so there was a high possibility that they had still a good hearing. Six of them had a previous experiences with similar listening tests.

[Figure]

Figure 5: Test results for the individual samples.

In Figure are displayed the test results for each of a samples separately. As we can see on the 32 kbps bit rate is Ogg Vorbis the best assessed codec over all samples. Most significant quality difference in comparison with other codecs is in the case of very complex signal with wide spectrum (Sample 2 - Tool) where for example MP3 totally collapsed. On the 64 kbps bit rate are the results much more balanced and OGG is not always the best.

[Figure]

Figure 6: Overall test results - numeric representation.

In Figure is a numerical representation of overall test results. On bit rate 64 kbps there are no significant quality differences. Quality of most codecs was evaluated as something between "Excellent" and "Good". Only MP3 was rated as something between "Good" and "Fair". Quite different situation is on bit rate 32 kbps. Quality of MP3 and AAC felt down almost to "Poor", in the case of WMA to "Fair". It is very interesting that OGG has a higher rating on bit rate 32 than on 64 kbps. I think it is because quality of other samples was in comparison with OGG so poor that listeners were much more generous while giving "points".

[Figure]

Figure 7: Overall test results - graphical representation.

Figure shows a graphical representation of overall test results. As we can see from results above Ogg Vorbis is a excellent choice if we need the finest sound quality at the very low bit rates (for example to stream internet radio, etc.).

References

[Vor00] Xiph.Org Foundation: Vorbis I Specification.. Available online.
[Ebu00] Stoll, G., Kozamernik, F,: EBU listening tests on Internet audio codecs. EBU Technical Review, June 2000. Available online.
[Sv05] Svítek, J.: Metody subjektivniho testovani kompresoru zvuku. Master thesis, Dept. of Radioelectronics, FEE CTU, Prague, June 2005.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz