반응형
Automated Objective and Subjective Evaluation of HTTP Adaptive Streaming Systems
HTTP 적응 형 스트리밍 시스템의 객관적이고 객관적인 평가
Abstract—Streaming audio and video content currently accounts for the majority of the internet traffic and is typically deployed over the top of the existing infrastructure. We are facing the challenge of a plethora of media players and adaptation algorithms showing different behavior but lack a common framework for both objective and subjective evaluation of such systems. This paper aims to close this gap by (i) proposing such a framework, (ii) describing its architecture, (iii) providing an example evaluation, (iv) and discussing open issues.
초록 - 스트리밍 오디오 및 비디오 콘텐츠는 현재 인터넷 트래픽의 대부분을 차지하며 일반적으로 기존 인프라의 상단에 배치됩니다. 우리는 다양한 행동을 보여주는 많은 미디어 플레이어 및 적응 알고리즘에 대한 도전에 직면 해 있지만 이러한 시스템의 객관적 평가와 주관적 평가를위한 공통 프레임 워크가 부족합니다. 이 논문은 (i) 그러한 틀을 제안하고, (ii) 아키텍처를 기술하며, (iii) 예제 평가를 제공하고, (iv) 공개 이슈를 논의함으로써이 차이를 좁히는 것을 목표로한다.
Keywords-HTTP Adaptive Streaming; objective evaluation; subjective evaluation; MPEG-DASH
키워드 -HTML 적응 형 스트리밍 객관적인 평가; 주관적 평가; MPEG-DASH
I. INTRODUCTION
Adaptive bitrate (ABR) streaming or HTTP-based adaptive streaming (HAS) over the open, unmanaged internet has become the de-facto standard for the delivery of audiovisual content. Currently, the most popular two formats are (1) Dynamic Adaptive Streaming over HTTP (DASH) [1] and (2) HTTP Live Streaming (HLS) [2], which may converge to a single format in the near future thanks to the standardization of Common Media Application Format (CMAF) [3]. Independent of the representation format, the media is provided in multiple versions (e.g., different resolutions and bitrates) and each version is divided into chunks of a few seconds (typically 2-10 s). A client first receives a manifest describing the available content on a server, and then, the client requests chunks based on its context (e.g., observed available bandwidth, buffer status, and decoding capabilities). Thus, it is able to adapt the media presentation in a dynamic, adaptive way. In DASH, the chunks are referred to as segments and the manifest is called a media presentation description (MPD). In this paper, we use the terminology of DASH, however, this work can be also applied to any other format sharing the same principles.
In the past, we witnessed a plethora of research papers in this area, i.e., surveys available here [4], [5], however, we still lack a comprehensive evaluation framework for HAS systems in terms of both objective – i.e., Quality of Service (QoS) – and subjective – Quality of Experience (QoE) – metrics. Initial evaluations have been based on simple traffic shaping and network emulation tools [6] or means to rapidly prototype the adaptation algorithms [7]. Recently, we have seen various evaluation frameworks in this domain focusing on adaptation algorithms proposed both in academia and industry [8]–[10]. However, the main focus has been on QoS rather than QoE. The latter typically requires user studies, which are mainly conducted within controlled laboratory environments. Yet, nowadays crowdsourcing is also considered as a reliable tool [11] and various platforms have been proposed [12] for this purpose.
과거에 우리는이 분야에서 많은 연구 논문을 목격했는데 [4], [5], HAS 시스템에 대한 포괄적 인 평가 체계가 객관성 즉 품질 서비스 (QoS) - 주관적 - QoE (Quality of Experience) - 메트릭. 초기 평가는 간단한 트래픽 쉐이핑 및 네트워크 에뮬레이션 도구 [6] 또는 신속하게 적응 알고리즘을 프로토 타이핑하는 방법을 기반으로합니다 [7]. 최근 우리는이 영역에서 학계와 산업계 모두에서 제안 된 적응 알고리즘에 초점을 맞춘 다양한 평가 프레임 워크를 보아왔다 [8] - [10]. 그러나 주로 QoE보다는 QoS에 중점을 둡니다. 후자는 일반적으로 통제 된 실험실 환경 내에서 주로 수행되는 사용자 연구가 필요합니다. 그러나 요즈음 crowdsourcing은 신뢰할 수있는 도구로 여겨지며 [12] 다양한 플랫폼이 제안되었습니다 [12].
In this paper, we propose a flexible and comprehensive framework to conduct objective and subjective evaluations of HAS systems in a fully automated and scalable way. It provides the following features:
End-to-end HAS evaluation of players deployed in industry and algorithms proposed in academia under various conditions and use cases (e.g., codecs/representations, network configurations, end user devices, player competition, etc.).
다양한 조건 및 사용 사례 (예 : 코덱 / 표현, 네트워크 구성, 최종 사용자 장치, 플레이어 경쟁 등)에서 학계에서 제안 된 업계 및 알고리즘에 배포 된 플레이어의 종단 간 HAS 평가.
Collection and analysis of objective streaming performance metrics (e.g., startup time, stalls, quality switches, average bitrate).
Subjective quality assessment utilizing crowdsourcing for QoE evaluation of HAS systems and QoE model testing/verification (e.g., test or verify a proposed QoE model using subjective user studies).
The remainder of this paper is as follows. Section II comprises a detailed description of the architecture of the proposed framework. Section III presents example evaluation results to demonstrate the capabilities of the framework. A discussion and open research issues are provided in Section IV and Section V concludes the paper.
II. FRAMEWORK ARCHITECTURE
II. 프레임 워크 아키텍처
Our framework – as depicted in Figure 1 – supports both objective and subjective evaluation of HAS systems and is composed of AdViSE [13] and WESP [14] plus extensions.
AdViSE [13] is an adaptive video streaming evaluation framework for the automated testing of web-based media players and adaptation algorithms. It has been designed in an extensible way to support (i) different adaptive media content formats (e.g., DASH, HLS, CMAF), (ii) commercially deployed media players as well as implementations of adaptation algorithms proposed in the research literature, and (iii) various networking parameters (e.g., bandwidth, delay) through network emulation. The output of AdViSE comprises a set of QoS and (objective) QoE metrics gathered and calculated during the adaptive streaming evaluation as well as a log of segment requests, which are used to generate the impaired media sequences used for the subjective evaluation.
Figure 1: General framework architecture: Adaptive Video Streaming Evaluation framework for the automated testing of web-based media players and adaptation algorithms (AdViSE)&WEb-based Subjective evaluation Platform (WESP).
그림 1 : 일반 프레임 워크 아키텍처 : 웹 기반 미디어 플레이어 및 적응 알고리즘 (AdViSE) 및 WEB 기반 주관 평가 플랫폼 (WESP)의 자동화 된 테스트를위한 Adaptive Video Streaming 평가 프레임 워크.
The subjective evaluation is based on WESP [14] , which is a web-based subjective evaluation platform using existing crowdsourcing platforms for subject recruitment implementing best practices according to [11]. WESP takes the impaired media sequences as an input and allows for a flexible configuration of various QoE evaluation parameters such as (i) typical questionnaire assets (e.g., drop-down menus, radio buttons, free text fields), (ii) subjective quality assessment methodology based on ITU recommendations (e.g., absolute category rating), and (iii) different crowdsourcing platforms (e.g., Microworkers, Mechanical Turk). The output of WESP comprises the subjective results, e.g., mean opinion scores (MOS), and any other data gathered during the subjective quality assessment . Together with the outputs of AdViSE, it is used to generate fully automated reports and data export functions, which are eventually used for further analysis .
In the following we provide a detailed description of AdViSE and WESP focusing on how they connect with each other leading to a fully automated objective and subjective evaluation of HAS systems. Further details about the individual building blocks can be found in [8], [9], [13], [14].
A. AdViSE: Adaptive Video Streaming Evaluation AdViSE includes the following components as depicted in Figure 2:
Web server with standard HTTP hosting the media content and a MySQL database.
Figure 2: Architecture of adaptive video streaming evaluation framework for the automated testing of media players and adaptation algorithms.
그림 2 : 미디어 플레이어 및 적응 알고리즘의 자동화 된 테스트를위한 적응 형 비디오 스트리밍 평가 프레임 워크의 아키텍처.
Network emulation server with a customized Mininet1 environment for, e.g., bandwidth shaping.
Selenium2 servers for running adaptive media players/algorithms on various platforms. Note there might be multiple physical servers each hosting a limited set of players/algorithms.
Web management interface for (i) conducting the experiments and (ii) running the adaptive media players.
AdViSE defines a flexible system that allows adding new adaptive media players/algorithms relatively fast. The Web management interface provides two functions, (i) for configuring and conducting the experiments, and (ii) including the actual player/algorithm to provide real-time information about the currently conducted experiment. Thus, the proposed framework in this paper provides means for a comprehensive end-to-end evaluation of adaptive streaming services over HTTP including the possibility for subjective quality testing. The interface allows to define the following items and parameters: (1) configuration of network emulation profiles including the bandwidth trajectory, packet loss, and packet delay; (2) specification of the number of runs of an experiment; and (3) selection of one or more adaptive HTML5 player (or adaptation algorithm) and the adaptive streaming format used (e.g., DASH, HLS, CMAF).
The result page provides a list of conducted experiments and the analytics section contains various metrics of the conducted experiments. It is possible to generate graphs for the results by using Highcharts3 and export the raw values for further offline analysis. The following quality parameters and metrics are currently available: (i) startup time; (ii) stalls (or buffer underruns); (iii) number of quality switches; (iv) download bitrate; (v) buffer length; (vi) average bitrate; (vii) instability and inefficiency; and (viii) simple QoE models specially designed for HAS. Further metrics can be easily added based on what the application programming interfaces (APIs) of players actually offer, and as new metrics or QoE models become available.
Finally, AdViSE provides the log of the segment requests, which are used – together with metrics such as startup time and stalls – to generate a media sequence as received by the player, and consequently, perceived by the user. The request log is used to concatenate the segments according to the request schedule of the player, thus, reflecting the media bitrate and quality switches. Other impairments such as startup time or stalls are automatically inserted based on the corresponding metrics gathered during the evaluation and by using predefined templates (e.g., stalls displayed as spinning wheel). This impaired media sequence is used in the subsequent step for the subjective QoE evaluation using WESP, which could also include the unimpaired media presentation depending on the employed evaluation method.
Summary: AdViSE provides scalable, end-to-end HAS evaluation through emulation with a plethora of configuration possibilities regarding content configuration, players/algorithms (including for player competition), and network parameters. With AdViSE it is possible to utilize actual content and network settings with actual dynamic, adaptive streaming including rendering. We collect various metrics from players based on their API (i.e., when access to source code is restricted) or from the algorithms/HTML5 directly. Additionally, we implemented so-called derived metrics and utilize QoE models proposed in the literature. Finally, the segment request log is used to generate impaired media sequence as perceived by end users for subjective quality testing.
요약 : AdViSE는 콘텐츠 구성, 플레이어 / 알고리즘 (플레이어 경쟁 포함) 및 네트워크 매개 변수와 관련하여 구성 가능성이 과다한 에뮬레이션을 통해 확장 가능한 엔드 투 엔드 HAS 평가를 제공합니다. AdViSE를 사용하면 렌더링을 포함한 실제 동적 인 적응 형 스트리밍으로 실제 콘텐츠 및 네트워크 설정을 활용할 수 있습니다. Google은 플레이어의 API (즉, 소스 코드에 대한 액세스가 제한되는 경우) 또는 알고리즘 / HTML5에서 직접 다양한 통계를 수집합니다. 또한, 우리는 소위 파생 메트릭을 구현하고 문헌에서 제안 된 QoE 모델을 활용합니다. 마지막으로 세그먼트 요청 로그는 주관적인 품질 테스트를 위해 최종 사용자가인지 한 손상된 미디어 시퀀스를 생성하는 데 사용됩니다.
Figure 3: Architecture of web-based subjective evaluation platform.
그림 3 : 웹 기반 주관적 평가 플랫폼의 아키텍처.
B. WESP: Web-Based Subjective Evaluation Platform
Subjective quality assessments (SQAs) are used as a vital tool for evaluating QoE. SQAs provide reliable results, but is considered as cost-intensive and SQAs are typically conducted within controlled laboratory environments. Crowdsourcing has been proposed as an alternative to reduce the cost, however, various aspects need to be considered in order to get similar reliable results [11]. In the past, several frameworks have been proposed leveraging crowdsourcing platforms to conduct SQAs with each providing different features [12]. However, a common shortcoming of these frameworks is that they required tedious configuration and setup for each SQA, which made it difficult to use. Therefore, we propose to use a web-based management platform, which shall (i) enable easy and simple configuration of SQAs including possible integration of third-party tools for onlinesurveys,(ii)providemeanstoconductSQAsusingthe existing crowdsourcing platforms taking into account best practice as discussed in [11], and (iii) allow for the analysis of the results.
The goal of WESP is not only to provide a framework, which fulfills the recommendations of the ITU for subjective evaluations of multimedia applications (e.g., BT.5004, P.9105, P.9116) but also provides the possibility to select and to configure the preferred evaluation method via a web interface. The conceptual WESP architecture, shown Figure 3, is implemented using HTML/PHP with MySQL database.
The introduction and questionnaires can be configured separately from the test methodology and may include control questions during the main evaluation. The voting possibility can be configured independently from the test methodology, providing more flexibility in selecting the appropriate voting mechanism and rating scale. The predefined voting mechanisms include the common HTML interface elements and some custom controls like a slider in different variations. The platform consists of a management layer and a presentation layer. The management layer allows for maintaining the user study such as adding new questions or multimedia content and setting up the test method to be used (e.g., single stimulus, double stimulus, pair comparison, continuous quality evaluation, etc.). The presentation layer is responsible for presenting the content to the participants. This allows providing different views on the user study, and thus, one can define groups to which the participants may be randomly (or in a predefined way) assigned. After a participant finishes the user study, the gathered data is stored in a MySQL database. Furthermore, the platform offers methods of tracking the participant’s behavior during a SQA (e.g., focus of web browser’s window/tab, time for consuming each stimuli presentation, time it takes for the voting phase, etc.) and data provided by the web player API.
Figure 4: Download video bitrate for the selected industry players (left) and adaptation algorithms proposed in the research literature (right).
그림 4 : 선택한 업계 선수들을위한 비디오 비트 레이트 다운로드 (왼쪽) 및 연구 논문에서 제안 된 적응 알고리즘 (오른쪽).
The stimuli presentation can be configured independently from the test method and may be combined with the voting possibility to support continuous quality evaluations. The media content can be fully downloaded and cached on the evaluation device prior starting the actual media presentation to avoid glitches during the evaluation, e.g., due to network issues. However, it also supports streaming evaluation in real-world environments where various metrics (e.g., startup time, stalls) are collected and stored for analysis.
Summary: WESP provides an extensible, web-based QoE evaluation platform utilizing crowdsourcing. It supports a plethora of evaluation methodologies and and configuration possibilities. Although it has been specifically designed to implement SQAs for HAS systems using crowdsourcing (including support for real-world environments), it can be also used for SQAs within laboratory environments.
III. EXAMPLE EVALUATION RESULTS
In this section, we provide example evaluation results of selected industryplayers and adaptation algorithmsproposed in the research literature: Bitmovin v7.07, dash.js v2.4.08, Flowplayerv6.0.59, FESTIVE [15], Instant [16],Thang [17]. Note that we show only a small selection – also due to page limits – and the results presented here should be only seen as an example of what the framework provides rather than a full-fledged player comparison sheet. Additional further results using the tools described in this paper can be found in [8], [9], [18].
For the evaluation we used, the Big Buck Bunny sequence10 and encoded it according to the Amazon Prime video service, which offers 15 different representations as follows: 400x224 (100 Kbps), 400x224 (150), 512x288 (200), 512x288 (300), 512x288 (500), 640x360 (800), 704x396 (1200), 704x396 (1800), 720x404 (2400), 720x404 (2500), 960x540 (2995), 1280x720 (3000), 1280x720 (4500), 1920x1080 (8000), and 1920x1080 (15000). The segment length is 4 s and one audio representation at 128 Kbps is used. We adopt the bandwidth trajectory from [6] providing both step-wise and abrupt changes in the available bandwidth, i.e., 750 Kbps (65 seconds), 350 Kbps (90), 2500 Kbps (120), 500 Kbps (90), 700 Kbps (30), 1500 Kbps (30), 2500 Kbps (30), 3500 Kbps (30), 2000 Kbps (30), 1000 Kbps (30) and 500 Kbps (85). The network delay is set to 70 milliseconds.
Figure 4 shows the download bitrate for the players and algorithms in question, and Table I provides an overview of all metrics. Metrics a.–e. are directly retrieved from the player/HTML5 API and algorithm implementation, respectively. Metrics f.–g. utilize simple QoE models [19], [20] to calculate MOS values ranging from one to five based on a subset of other metrics. Interestingly, industry players and research algorithms provide different performance behavior under the same conditions but can be directly compared among each other. A detailed analysis of the results is out of scope of this paper.
IV. DISCUSSION AND CHALLENGES
IV. 토론과 도전
In this section, we provide a discussion about our framework for the automated objective and subjective evaluation of HAS systems. It allows for an easy setup of various configurations and running multiple evaluations in parallel. New players and algorithms can be added easily as they appear in the market and research literature. Over time it is possible to build up a repository of players and algorithms for comprehensive performance evaluation. As it is possible to run multiple Selenium servers in parallel, our framework is capable to evaluate when players/algorithms compete for bandwidth in various configurations (e.g., n player A vs. m player B).
The framework is quite flexible, and thus, comes with a high number of degrees of freedom. Hence, it is important to design the evaluation carefully. Here we provide a brief list of the aspects to consider: (A) Content assets: content type, codec/coding parameters (including HDR, WCG), representations (bitrate/resolution pairs, also referred to as bitrate ladder), segment length (including GOPsize),representationformat(i.e.,DASH,HLS,CMAF), etc.; (B) Network parameters: bandwidth trajectory (i.e., predefined, network traces), delay, loss, and other networking aspects (see below for further details); (C) End user device environment: device type, operating system, browser, etc.; (D) Streaming performance metrics: average bitrate, startup time, stalls (frequency, duration), quality switches (frequency, amplitude), etc.; (E) Quantitative QoE models based on audio-video quality and/or streaming performance metrics; (F) General HAS evaluation setup: live vs. ondemand content, single player vs. multiple players competing for bandwidth, etc.; (G) Templates for generating the impaired media sequence (i.e., how to realize startup delay and stalls); (H) Questionnaire for SQA including control questions for crowdsourcing; (I) SQA method (e.g., single stimulus, double stimulus, pair-wise comparison) and its parametrization; and (J) Collection of all results and further (offline) analysis.
All these aspects are important to consider and a potential source of risk when conducting such experiments. A detailed analysis is beyond the scope of this paper.
Based on our experience when conducting multiple evaluations and performance comparisons, we identified the following research challenges, possibly subject to future work: (I) The reliability of results require cross-validation, specifically those from SQAs, which typically call for SQAs in controlled laboratory environments. (II) The network is a key aspect within HAS systems but often neglected. Network emulation is a vital tool but with limitations. For HAS systems, we also need to consider content distribution networks (CDNs), software-defined networking (SDN), information-centric networking (ICN), and next-generation (mobile) networks (e.g., 5G). Detailed analysis and evaluations of these aspects in the context of HAS are currently missing. However, for example, recent standardization and research contributions showed benefits for HAS systems when combined with SDN [21]. (III) Reproducibility of such a framework can be achieved be providing containerized versions of the modules as done in [10]. This is considered critical when using industry players, which often require licenses. Additionally, it could be interesting to connect to large-scale research networks (such as PlanetLab11, Virtual Internet Routing Lab12, or GENI13).
V. CONCLUSIONS
This paper describes how AdViSE and WESP can be combined to perform objective and subjective evaluations of HAS systems in a fully automated and scalable way. For example, it can be used to test and compare new players/algorithms under various context conditions or research new QoE models with practically instant verification through subjective tests. The main finding of this work is, that a comprehensive objective and subjective evaluation of HAS systems is feasible for both industry players and adaptation algorithms proposed in the research literature. Hence, we recommend adopting it when proposing new features in this area and evaluating these features against the state of the art.
반응형
'# 03 > 프로토콜' 카테고리의 다른 글
Comparing Adaptive HTTP Streaming Technologies-2 (0) | 2019.02.05 |
---|---|
Comparing Adaptive HTTP Streaming Technologies-1 (0) | 2019.02.05 |
MPEG DASH (0) | 2019.02.05 |
HLS (0) | 2019.02.05 |
HDS (0) | 2019.02.05 |