Automated Objective and Subjective Evaluation of HTTP Adaptive Streaming Systems

HTTP 적응 형 스트리밍 시스템의 객관적이고 객관적인 평가

Abstract—Streaming audio and video content currently accounts for the majority of the internet trafﬁc and is typically deployed over the top of the existing infrastructure. We are facing the challenge of a plethora of media players and adaptation algorithms showing different behavior but lack a common framework for both objective and subjective evaluation of such systems. This paper aims to close this gap by (i) proposing such a framework, (ii) describing its architecture, (iii) providing an example evaluation, (iv) and discussing open issues.

초록 - 스트리밍 오디오 및 비디오 콘텐츠는 현재 인터넷 트래픽의 대부분을 차지하며 일반적으로 기존 인프라의 상단에 배치됩니다. 우리는 다양한 행동을 보여주는 많은 미디어 플레이어 및 적응 알고리즘에 대한 도전에 직면 해 있지만 이러한 시스템의 객관적 평가와 주관적 평가를위한 공통 프레임 워크가 부족합니다. 이 논문은 (i) 그러한 틀을 제안하고, (ii) 아키텍처를 기술하며, (iii) 예제 평가를 제공하고, (iv) 공개 이슈를 논의함으로써이 차이를 좁히는 것을 목표로한다.

Keywords-HTTP Adaptive Streaming; objective evaluation; subjective evaluation; MPEG-DASH

키워드 -HTML 적응 형 스트리밍 객관적인 평가; 주관적 평가; MPEG-DASH

I. INTRODUCTION

I. 서론

Adaptive bitrate (ABR) streaming or HTTP-based adaptive streaming (HAS) over the open, unmanaged internet has become the de-facto standard for the delivery of audiovisual content. Currently, the most popular two formats are (1) Dynamic Adaptive Streaming over HTTP (DASH) [1] and (2) HTTP Live Streaming (HLS) [2], which may converge to a single format in the near future thanks to the standardization of Common Media Application Format (CMAF) [3]. Independent of the representation format, the media is provided in multiple versions (e.g., different resolutions and bitrates) and each version is divided into chunks of a few seconds (typically 2-10 s). A client ﬁrst receives a manifest describing the available content on a server, and then, the client requests chunks based on its context (e.g., observed available bandwidth, buffer status, and decoding capabilities). Thus, it is able to adapt the media presentation in a dynamic, adaptive way. In DASH, the chunks are referred to as segments and the manifest is called a media presentation description (MPD). In this paper, we use the terminology of DASH, however, this work can be also applied to any other format sharing the same principles.

개방형 비 관리 형 인터넷을 통한 ABR (adaptive bitrate) 스트리밍 또는 HTTP 기반 적응 형 스트리밍 (HAS)은 시청각 콘텐츠 제공의 사실상 표준이되었습니다. 현재 가장 많이 사용되는 두 가지 형식은 (1) HTTP (DASH) [1] 및 (2) HTTP 실시간 스트리밍 (HLS)을 통한 동적 적응 형 스트리밍 (Dynamic Adaptive Streaming)이며 [2], 가까운 미래에 단일 형식으로 수렴 할 수 있습니다. Common Media Application Format (CMAF) [3]의 표준화 표현 형식과는 별도로, 미디어는 여러 버전 (예 : 서로 다른 해상도 및 비트 전송률)으로 제공되며 각 버전은 몇 초 (일반적으로 2-10 초)의 청크로 나뉩니다. 클라이언트는 서버에서 사용 가능한 컨텐트를 설명하는 매니 페스트를 수신 한 다음 클라이언트가 컨텍스트 (예 : 사용 가능한 대역폭, 버퍼 상태 및 디코딩 기능)에 따라 청크를 요청합니다. 따라서 동적 인 적응 형 방식으로 미디어 프리젠 테이션을 적용 할 수 있습니다. DASH에서 청크를 세그먼트로 지칭하고 매니페스트를 MPD (Media Presentation Description)라고합니다. 이 글에서는 DASH의 용어를 사용하지만,이 작업은 같은 원칙을 공유하는 다른 형식에도 적용될 수 있습니다.

In the past, we witnessed a plethora of research papers in this area, i.e., surveys available here [4], [5], however, we still lack a comprehensive evaluation framework for HAS systems in terms of both objective – i.e., Quality of Service (QoS) – and subjective – Quality of Experience (QoE) – metrics. Initial evaluations have been based on simple trafﬁc shaping and network emulation tools [6] or means to rapidly prototype the adaptation algorithms [7]. Recently, we have seen various evaluation frameworks in this domain focusing on adaptation algorithms proposed both in academia and industry [8]–[10]. However, the main focus has been on QoS rather than QoE. The latter typically requires user studies, which are mainly conducted within controlled laboratory environments. Yet, nowadays crowdsourcing is also considered as a reliable tool [11] and various platforms have been proposed [12] for this purpose.

과거에 우리는이 분야에서 많은 연구 논문을 목격했는데 [4], [5], HAS 시스템에 대한 포괄적 인 평가 체계가 객관성 즉 품질 서비스 (QoS) - 주관적 - QoE (Quality of Experience) - 메트릭. 초기 평가는 간단한 트래픽 쉐이핑 및 네트워크 에뮬레이션 도구 [6] 또는 신속하게 적응 알고리즘을 프로토 타이핑하는 방법을 기반으로합니다 [7]. 최근 우리는이 영역에서 학계와 산업계 모두에서 제안 된 적응 알고리즘에 초점을 맞춘 다양한 평가 프레임 워크를 보아왔다 [8] - [10]. 그러나 주로 QoE보다는 QoS에 중점을 둡니다. 후자는 일반적으로 통제 된 실험실 환경 내에서 주로 수행되는 사용자 연구가 필요합니다. 그러나 요즈음 crowdsourcing은 신뢰할 수있는 도구로 여겨지며 [12] 다양한 플랫폼이 제안되었습니다 [12].

In this paper, we propose a ﬂexible and comprehensive framework to conduct objective and subjective evaluations of HAS systems in a fully automated and scalable way. It provides the following features:

이 논문에서는 완전 자동화되고 확장 가능한 방식으로 HAS 시스템의 객관적이고 주관적인 평가를 수행 할 수있는 유연하고 포괄적 인 프레임 워크를 제안합니다. 다음 기능을 제공합니다.

End-to-end HAS evaluation of players deployed in industry and algorithms proposed in academia under various conditions and use cases (e.g., codecs/representations, network conﬁgurations, end user devices, player competition, etc.).

다양한 조건 및 사용 사례 (예 : 코덱 / 표현, 네트워크 구성, 최종 사용자 장치, 플레이어 경쟁 등)에서 학계에서 제안 된 업계 및 알고리즘에 배포 된 플레이어의 종단 간 HAS 평가.

Collection and analysis of objective streaming performance metrics (e.g., startup time, stalls, quality switches, average bitrate).

객관적인 스트리밍 성능 메트릭 (예 : 시작 시간, 정지, 품질 스위치, 평균 비트 전송률) 수집 및 분석.

Subjective quality assessment utilizing crowdsourcing for QoE evaluation of HAS systems and QoE model testing/veriﬁcation (e.g., test or verify a proposed QoE model using subjective user studies).

HAS 시스템의 QoE 평가 및 QoE 모델 테스트 / 검증 (예 : 주관적인 사용자 연구를 사용하여 제안 된 QoE 모델 테스트 또는 검증)을 위해 크라우드 소싱 (crowdsourcing)을 활용 한 주관적 품질 평가.

The remainder of this paper is as follows. Section II comprises a detailed description of the architecture of the proposed framework. Section III presents example evaluation results to demonstrate the capabilities of the framework. A discussion and open research issues are provided in Section IV and Section V concludes the paper.

이 논문의 나머지 부분은 다음과 같다. 섹션 II는 제안 된 프레임 워크의 아키텍처에 대한 자세한 설명을 포함합니다. III 절에서는 프레임 워크의 기능을 설명하기위한 평가 결과의 예를 제시합니다. 토론과 공개 된 연구 이슈는 IV 절에서 제공되고 V 절은이 논문을 결론 짓는다.

II. FRAMEWORK ARCHITECTURE

II. 프레임 워크 아키텍처

Our framework – as depicted in Figure 1 – supports both objective and subjective evaluation of HAS systems and is composed of AdViSE [13] and WESP [14] plus extensions.

그림 1에서 볼 수 있듯이 우리의 프레임 워크는 HAS 시스템의 객관적이고 주관적인 평가를 지원하며 AdViSE [13] 및 WESP [14] 및 확장으로 구성됩니다.

AdViSE [13] is an adaptive video streaming evaluation framework for the automated testing of web-based media players and adaptation algorithms. It has been designed in an extensible way to support (i) different adaptive media content formats (e.g., DASH, HLS, CMAF), (ii) commercially deployed media players as well as implementations of adaptation algorithms proposed in the research literature, and (iii) various networking parameters (e.g., bandwidth, delay) through network emulation. The output of AdViSE comprises a set of QoS and (objective) QoE metrics gathered and calculated during the adaptive streaming evaluation as well as a log of segment requests, which are used to generate the impaired media sequences used for the subjective evaluation.

AdViSE [13]는 웹 기반 미디어 플레이어 및 적응 알고리즘의 자동화 된 테스팅을위한 적응 형 비디오 스트리밍 평가 프레임 워크이다. (i) DAS, HLS, CMAF와 같은 다양한 적응 형 미디어 콘텐츠 형식, 상업적으로 배포 된 미디어 플레이어는 물론 연구 문헌에서 제안 된 적응 알고리즘의 구현을 지원할 수있는 확장 가능한 방식으로 설계되었습니다. iii) 네트워크 에뮬레이션을 통한 다양한 네트워킹 매개 변수 (예 : 대역폭, 지연). AdViSE의 출력은 주관적인 평가에 사용되는 손상된 미디어 시퀀스를 생성하는 데 사용되는 세그먼트 요청 로그뿐만 아니라 적응 형 스트리밍 평가 중에 수집되고 계산 된 QoS 및 (객관적인) QoE 메트릭스 세트로 구성됩니다.

Figure 1: General framework architecture: Adaptive Video Streaming Evaluation framework for the automated testing of web-based media players and adaptation algorithms (AdViSE)&WEb-based Subjective evaluation Platform (WESP).

그림 1 : 일반 프레임 워크 아키텍처 : 웹 기반 미디어 플레이어 및 적응 알고리즘 (AdViSE) 및 WEB 기반 주관 평가 플랫폼 (WESP)의 자동화 된 테스트를위한 Adaptive Video Streaming 평가 프레임 워크.

The subjective evaluation is based on WESP [14] , which is a web-based subjective evaluation platform using existing crowdsourcing platforms for subject recruitment implementing best practices according to [11]. WESP takes the impaired media sequences as an input and allows for a ﬂexible conﬁguration of various QoE evaluation parameters such as (i) typical questionnaire assets (e.g., drop-down menus, radio buttons, free text ﬁelds), (ii) subjective quality assessment methodology based on ITU recommendations (e.g., absolute category rating), and (iii) different crowdsourcing platforms (e.g., Microworkers, Mechanical Turk). The output of WESP comprises the subjective results, e.g., mean opinion scores (MOS), and any other data gathered during the subjective quality assessment . Together with the outputs of AdViSE, it is used to generate fully automated reports and data export functions, which are eventually used for further analysis .

주관적 평가는 WESP [14]를 기반으로하며, 웹 기반 주관적 평가 플랫폼으로 [11]에 따라 모범 사례를 구현하는 기존의 crowdsourcing 플랫폼을 사용합니다. WESP는 손상된 미디어 시퀀스를 입력으로 사용하여 (i) 일반적인 질문지 자산 (예 : 드롭 다운 메뉴, 라디오 버튼, 자유 텍스트 필드), (ii) 주관적 품질 평가와 같은 다양한 QoE 평가 매개 변수의 유연한 구성을 허용합니다 ITU 권고안 (예 : 절대 카테고리 등급)에 기반한 방법론, (iii) 다른 crowdsourcing 플랫폼 (예 : Microworkers, Mechanical Turk). WESP의 출력은 주관적인 결과, 예를 들어 평균 의견 점수 (MOS) 및 주관적 품질 평가 중에 수집 된 기타 데이터로 구성됩니다. AdViSE의 출력물과 함께,이 도구는 완전히 자동화 된 보고서 및 데이터 내보내기 기능을 생성하는 데 사용되며, 나중에 추가 분석을 위해 사용됩니다.

In the following we provide a detailed description of AdViSE and WESP focusing on how they connect with each other leading to a fully automated objective and subjective evaluation of HAS systems. Further details about the individual building blocks can be found in [8], [9], [13], [14].

다음에서는 AdViSE 및 WESP에 대한 자세한 설명을 제공하여 HAS 시스템에 대한 완전 자동화 된 목표 및 주관적 평가를 유도하는 방법에 대해 설명합니다. 개별 빌딩 블록에 대한 자세한 내용은 [8], [9], [13], [14]에서 찾을 수 있습니다.

A. AdViSE: Adaptive Video Streaming Evaluation AdViSE includes the following components as depicted in Figure 2:

A. AdViSE : 적응 형 비디오 스트리밍 평가 AdViSE는 그림 2와 같이 다음 구성 요소를 포함합니다.

Web server with standard HTTP hosting the media content and a MySQL database.

미디어 컨텐츠 및 MySQL 데이터베이스를 호스팅하는 표준 HTTP가있는 웹 서버.

Figure 2: Architecture of adaptive video streaming evaluation framework for the automated testing of media players and adaptation algorithms.

그림 2 : 미디어 플레이어 및 적응 알고리즘의 자동화 된 테스트를위한 적응 형 비디오 스트리밍 평가 프레임 워크의 아키텍처.

Network emulation server with a customized Mininet1 environment for, e.g., bandwidth shaping.

예를 들어 대역폭 형성을 위해 Mininet1 환경을 사용자 정의한 네트워크 에뮬레이션 서버.

Selenium2 servers for running adaptive media players/algorithms on various platforms. Note there might be multiple physical servers each hosting a limited set of players/algorithms.

다양한 플랫폼에서 적응 형 미디어 플레이어 / 알고리즘을 실행하기위한 Selenium2 서버. 제한된 플레이어 / 알고리즘 집합을 호스팅하는 물리적 서버가 여러 개있을 수 있습니다.

Web management interface for (i) conducting the experiments and (ii) running the adaptive media players.

(i) 실험을 수행하고 (ii) 적응 형 미디어 플레이어를 실행하기위한 웹 관리 인터페이스.

AdViSE deﬁnes a ﬂexible system that allows adding new adaptive media players/algorithms relatively fast. The Web management interface provides two functions, (i) for conﬁguring and conducting the experiments, and (ii) including the actual player/algorithm to provide real-time information about the currently conducted experiment. Thus, the proposed framework in this paper provides means for a comprehensive end-to-end evaluation of adaptive streaming services over HTTP including the possibility for subjective quality testing. The interface allows to deﬁne the following items and parameters: (1) conﬁguration of network emulation proﬁles including the bandwidth trajectory, packet loss, and packet delay; (2) speciﬁcation of the number of runs of an experiment; and (3) selection of one or more adaptive HTML5 player (or adaptation algorithm) and the adaptive streaming format used (e.g., DASH, HLS, CMAF).

AdViSE는 새로운 적응 형 미디어 플레이어 / 알고리즘을 비교적 빠르게 추가 할 수있는 유연한 시스템을 제공합니다. 웹 관리 인터페이스는 (i) 실험을 구성하고 수행하고, (ii) 현재 수행 된 실험에 대한 실시간 정보를 제공하기 위해 실제 플레이어 / 알고리즘을 포함하는 두 가지 기능을 제공합니다. 따라서 본 백서의 제안 된 프레임 워크는 주관적인 품질 테스트의 가능성을 포함하여 HTTP를 통한 적응 형 스트리밍 서비스의 포괄적 인 종단 간 평가 방법을 제공합니다. 인터페이스는 다음 항목과 파라미터를 정의 할 수있게 해준다 : (1) 대역폭 궤적, 패킷 손실 및 패킷 지연을 포함하는 네트워크 에뮬레이션 프로파일의 구성. (2) 실험 실행 횟수의 지정; (3) 하나 이상의 적응 형 HTML5 플레이어 (또는 적응 알고리즘) 및 사용 된 적응 형 스트리밍 포맷 (예를 들어, DASH, HLS, CMAF)의 선택.

The result page provides a list of conducted experiments and the analytics section contains various metrics of the conducted experiments. It is possible to generate graphs for the results by using Highcharts3 and export the raw values for further ofﬂine analysis. The following quality parameters and metrics are currently available: (i) startup time; (ii) stalls (or buffer underruns); (iii) number of quality switches; (iv) download bitrate; (v) buffer length; (vi) average bitrate; (vii) instability and inefﬁciency; and (viii) simple QoE models specially designed for HAS. Further metrics can be easily added based on what the application programming interfaces (APIs) of players actually offer, and as new metrics or QoE models become available.

결과 페이지는 수행 된 실험 목록을 제공하고 분석 섹션에는 수행 된 실험의 다양한 측정 항목이 포함됩니다. Highcharts3을 사용하여 결과에 대한 그래프를 생성하고 더 많은 플레인 분석을 위해 원시 값을 내보낼 수 있습니다. 현재 다음과 같은 품질 매개 변수 및 메트릭을 사용할 수 있습니다. (i) 시작 시간; (ii) 포장 마차 (또는 버퍼 언더런); (iii) 품질 스위치의 수; (iv) 다운로드 전송률; (v) 버퍼 길이; (vi) 평균 비트 레이트; (vii) 불안정 및 불공정; (viii) HAS를 위해 특별히 고안된 간단한 QoE 모델. 플레이어의 API (Application Programming Interface)가 실제로 제공하는 것과 새로운 메트릭 또는 QoE 모델이 제공됨에 따라 더 많은 메트릭을 쉽게 추가 할 수 있습니다.

Finally, AdViSE provides the log of the segment requests, which are used – together with metrics such as startup time and stalls – to generate a media sequence as received by the player, and consequently, perceived by the user. The request log is used to concatenate the segments according to the request schedule of the player, thus, reﬂecting the media bitrate and quality switches. Other impairments such as startup time or stalls are automatically inserted based on the corresponding metrics gathered during the evaluation and by using predeﬁned templates (e.g., stalls displayed as spinning wheel). This impaired media sequence is used in the subsequent step for the subjective QoE evaluation using WESP, which could also include the unimpaired media presentation depending on the employed evaluation method.

마지막으로, AdViSE는 세그먼트 요청의 로그를 제공합니다.이 로그는 시작 시간 및 정지와 같은 측정 항목과 함께 사용되어 플레이어가 수신 한 미디어 시퀀스를 생성하므로 결과적으로 사용자가인지합니다. 요청 로그는 플레이어의 요청 일정에 따라 세그먼트를 연결하는 데 사용되므로 미디어 비트 전송률 및 품질 스위치가 반영됩니다. 스타트 업 시간이나 노점과 같은 다른 장애는 평가 중 수집 된 해당 메트릭과 사전 정의 된 템플리트 (예 : 회전 휠로 표시된 스톨)를 사용하여 자동으로 삽입됩니다. 이 손상된 미디어 시퀀스는 WESP를 사용한 주관적인 QoE 평가를위한 후속 단계에서 사용되며, 채택 된 평가 방법에 따라 손상되지 않은 미디어 프레젠테이션도 포함될 수 있습니다.

Summary: AdViSE provides scalable, end-to-end HAS evaluation through emulation with a plethora of conﬁguration possibilities regarding content conﬁguration, players/algorithms (including for player competition), and network parameters. With AdViSE it is possible to utilize actual content and network settings with actual dynamic, adaptive streaming including rendering. We collect various metrics from players based on their API (i.e., when access to source code is restricted) or from the algorithms/HTML5 directly. Additionally, we implemented so-called derived metrics and utilize QoE models proposed in the literature. Finally, the segment request log is used to generate impaired media sequence as perceived by end users for subjective quality testing.

요약 : AdViSE는 콘텐츠 구성, 플레이어 / 알고리즘 (플레이어 경쟁 포함) 및 네트워크 매개 변수와 관련하여 구성 가능성이 과다한 에뮬레이션을 통해 확장 가능한 엔드 투 엔드 HAS 평가를 제공합니다. AdViSE를 사용하면 렌더링을 포함한 실제 동적 인 적응 형 스트리밍으로 실제 콘텐츠 및 네트워크 설정을 활용할 수 있습니다. Google은 플레이어의 API (즉, 소스 코드에 대한 액세스가 제한되는 경우) 또는 알고리즘 / HTML5에서 직접 다양한 통계를 수집합니다. 또한, 우리는 소위 파생 메트릭을 구현하고 문헌에서 제안 된 QoE 모델을 활용합니다. 마지막으로 세그먼트 요청 로그는 주관적인 품질 테스트를 위해 최종 사용자가인지 한 손상된 미디어 시퀀스를 생성하는 데 사용됩니다.

Figure 3: Architecture of web-based subjective evaluation platform.

그림 3 : 웹 기반 주관적 평가 플랫폼의 아키텍처.

B. WESP: Web-Based Subjective Evaluation Platform

B. WESP : 웹 기반 주관 평가 플랫폼

Subjective quality assessments (SQAs) are used as a vital tool for evaluating QoE. SQAs provide reliable results, but is considered as cost-intensive and SQAs are typically conducted within controlled laboratory environments. Crowdsourcing has been proposed as an alternative to reduce the cost, however, various aspects need to be considered in order to get similar reliable results [11]. In the past, several frameworks have been proposed leveraging crowdsourcing platforms to conduct SQAs with each providing different features [12]. However, a common shortcoming of these frameworks is that they required tedious conﬁguration and setup for each SQA, which made it difﬁcult to use. Therefore, we propose to use a web-based management platform, which shall (i) enable easy and simple conﬁguration of SQAs including possible integration of third-party tools for onlinesurveys,(ii)providemeanstoconductSQAsusingthe existing crowdsourcing platforms taking into account best practice as discussed in [11], and (iii) allow for the analysis of the results.

주관적인 품질 평가 (SQAs)는 QoE 평가에 중요한 도구로 사용됩니다. SQA는 신뢰할 수있는 결과를 제공하지만 비용 집약적 인 것으로 간주되며 SQA는 일반적으로 통제 된 실험실 환경에서 수행됩니다. Crowdsourcing은 비용을 줄이기위한 대안으로 제시되었지만 유사한 신뢰성있는 결과를 얻기 위해서는 다양한 측면을 고려할 필요가 있습니다 [11]. 과거에는 크라우드 소싱 플랫폼을 활용하여 각기 다른 기능을 제공하는 SQA를 수행하는 몇 가지 프레임 워크가 제안되었습니다 [12]. 그러나 이러한 프레임 워크의 공통적 인 단점은 각 SQA에 지루한 설정과 설정이 필요하다는 점이었습니다. 사용하기가 어려웠습니다. 따라서 (i) 온라 인 조사를위한 제 3 자 도구의 통합 가능성을 포함하여 SQA의 쉽고 간단한 구성을 가능하게하고, (ii) 토론 된 모범 사례를 고려하여 기존의 크라우드 소싱 플랫폼을 사용하는 제공업자 제공 컨텐트를 제공하는 웹 기반 관리 플랫폼을 사용할 것을 제안합니다 [11]에서, 그리고 (iii) 결과 분석을 허용합니다.

The goal of WESP is not only to provide a framework, which fulﬁlls the recommendations of the ITU for subjective evaluations of multimedia applications (e.g., BT.5004, P.9105, P.9116) but also provides the possibility to select and to conﬁgure the preferred evaluation method via a web interface. The conceptual WESP architecture, shown Figure 3, is implemented using HTML/PHP with MySQL database.

WESP의 목표는 멀티미디어 응용 프로그램의 주관적인 평가를 위해 ITU의 권고 사항을 충족시키는 프레임 워크를 제공 할뿐만 아니라 (예 : BT.5004, P.9105, P.9116) 또한 선택하고 구성 할 수있는 가능성을 제공합니다 웹 인터페이스를 통한 선호 평가 방법. 그림 3에 표시된 개념적 WESP 아키텍처는 MySQL 데이터베이스와 함께 HTML / PHP를 사용하여 구현됩니다.

The introduction and questionnaires can be conﬁgured separately from the test methodology and may include control questions during the main evaluation. The voting possibility can be conﬁgured independently from the test methodology, providing more ﬂexibility in selecting the appropriate voting mechanism and rating scale. The predeﬁned voting mechanisms include the common HTML interface elements and some custom controls like a slider in different variations. The platform consists of a management layer and a presentation layer. The management layer allows for maintaining the user study such as adding new questions or multimedia content and setting up the test method to be used (e.g., single stimulus, double stimulus, pair comparison, continuous quality evaluation, etc.). The presentation layer is responsible for presenting the content to the participants. This allows providing different views on the user study, and thus, one can deﬁne groups to which the participants may be randomly (or in a predeﬁned way) assigned. After a participant ﬁnishes the user study, the gathered data is stored in a MySQL database. Furthermore, the platform offers methods of tracking the participant’s behavior during a SQA (e.g., focus of web browser’s window/tab, time for consuming each stimuli presentation, time it takes for the voting phase, etc.) and data provided by the web player API.

소개 및 설문지는 시험 방법론과 별도로 구성 될 수 있으며 주요 평가 중에 통제 질문을 포함 할 수 있습니다. 투표 가능성은 테스트 방법론과는 독립적으로 구성 할 수 있으므로 적절한 투표 메커니즘 및 평가 척도를 선택할 때 더 많은 유연성을 제공합니다. 미리 결정된 투표 메커니즘에는 일반적인 HTML 인터페이스 요소와 슬라이더와 같은 사용자 지정 컨트롤이 포함됩니다. 플랫폼은 관리 계층과 표현 계층으로 구성됩니다. 관리 계층은 새로운 질문이나 멀티미디어 컨텐츠 추가, 사용되는 테스트 방법 설정 (예 : 단일 자극, 이중 자극, 쌍 비교, 연속 품질 평가 등)과 같은 사용자 학습을 유지 관리 할 수 있습니다. 프리젠 테이션 레이어는 참가자에게 콘텐츠를 제공 할 책임이 있습니다. 이를 통해 사용자 연구에 대한 다양한 시각을 제공 할 수 있으므로 참가자가 무작위로 (또는 미리 정의 된 방식으로) 할당 된 그룹을 정의 할 수 있습니다. 참가자가 사용자 학습을 마무리하면 수집 된 데이터가 MySQL 데이터베이스에 저장됩니다. 또한 플랫폼은 SQA (예 : 웹 브라우저 창 / 탭의 초점, 각 자극 프레젠테이션을 소비하는 시간, 투표 단계에 소요되는 시간 등) 및 웹 플레이어가 제공 한 데이터에서 참가자 행동을 추적하는 방법을 제공합니다 API.

Figure 4: Download video bitrate for the selected industry players (left) and adaptation algorithms proposed in the research literature (right).

그림 4 : 선택한 업계 선수들을위한 비디오 비트 레이트 다운로드 (왼쪽) 및 연구 논문에서 제안 된 적응 알고리즘 (오른쪽).

The stimuli presentation can be conﬁgured independently from the test method and may be combined with the voting possibility to support continuous quality evaluations. The media content can be fully downloaded and cached on the evaluation device prior starting the actual media presentation to avoid glitches during the evaluation, e.g., due to network issues. However, it also supports streaming evaluation in real-world environments where various metrics (e.g., startup time, stalls) are collected and stored for analysis.

자극 제시는 시험 방법과 독립적으로 구성 될 수 있으며 지속적인 품질 평가를 지원하기 위해 투표 가능성과 결합 될 수 있습니다. 실제 미디어 프레젠테이션을 시작하기 전에, 예를 들어 네트워크 문제로 인해 평가 중에 결함을 피하기 위해, 미디어 컨텐츠는 평가 디바이스 상에 완전히 다운로드되고 캐싱 될 수있다. 그러나 다양한 메트릭 (예 : 시작 시간, 정지)이 수집되어 분석을 위해 저장되는 실제 환경에서 스트리밍 평가도 지원합니다.

Summary: WESP provides an extensible, web-based QoE evaluation platform utilizing crowdsourcing. It supports a plethora of evaluation methodologies and and conﬁguration possibilities. Although it has been speciﬁcally designed to implement SQAs for HAS systems using crowdsourcing (including support for real-world environments), it can be also used for SQAs within laboratory environments.

요약 : WESP는 crowdsourcing을 활용하는 확장 가능한 웹 기반 QoE 평가 플랫폼을 제공합니다. 그것은 많은 평가 방법론과 구성 가능성을 지원합니다. 크라우드 소싱 (실제 환경에 대한 지원 포함)을 사용하는 HAS 시스템을위한 SQA를 구현하기 위해 특별히 설계되었지만 실험실 환경에서 SQA에도 사용할 수 있습니다.

III. EXAMPLE EVALUATION RESULTS

III. 평가 결과의 예

In this section, we provide example evaluation results of selected industryplayers and adaptation algorithmsproposed in the research literature: Bitmovin v7.07, dash.js v2.4.08, Flowplayerv6.0.59, FESTIVE [15], Instant [16],Thang [17]. Note that we show only a small selection – also due to page limits – and the results presented here should be only seen as an example of what the framework provides rather than a full-ﬂedged player comparison sheet. Additional further results using the tools described in this paper can be found in [8], [9], [18].

이 섹션에서는 Bitmovin v7.07, dash.js v2.4.08, Flowplayerv6.0.59, FESTIVE [15], Instant [16], Thang [17]과 같은 산업별 연구자 및 적응 알고리즘에 대한 평가 결과를 제공합니다. . 우리는 페이지 제한 때문에 작은 선택만을 보여 주며, 여기에 제시된 결과는 전체 가재 된 플레이어 비교 시트가 아닌 프레임 워크가 제공하는 것의 예로서 만 보여야합니다. 이 문서에서 설명하는 도구를 사용한 추가 결과는 [8], [9], [18]에서 확인할 수 있습니다.

For the evaluation we used, the Big Buck Bunny sequence10 and encoded it according to the Amazon Prime video service, which offers 15 different representations as follows: 400x224 (100 Kbps), 400x224 (150), 512x288 (200), 512x288 (300), 512x288 (500), 640x360 (800), 704x396 (1200), 704x396 (1800), 720x404 (2400), 720x404 (2500), 960x540 (2995), 1280x720 (3000), 1280x720 (4500), 1920x1080 (8000), and 1920x1080 (15000). The segment length is 4 s and one audio representation at 128 Kbps is used. We adopt the bandwidth trajectory from [6] providing both step-wise and abrupt changes in the available bandwidth, i.e., 750 Kbps (65 seconds), 350 Kbps (90), 2500 Kbps (120), 500 Kbps (90), 700 Kbps (30), 1500 Kbps (30), 2500 Kbps (30), 3500 Kbps (30), 2000 Kbps (30), 1000 Kbps (30) and 500 Kbps (85). The network delay is set to 70 milliseconds.

평가를 위해 Big Buck Bunny 시퀀스 10을 사용하여 400x224 (100Kbps), 400x224 (150), 512x288 (200), 512x288 (300) 등 15 가지 표현을 제공하는 Amazon Prime 비디오 서비스에 따라 인코딩했습니다. (800), 720x404 (2500), 960x540 (2995), 1280x720 (3000), 1280x720 (4500), 1920x1080 (8000), 512x288 (500) , 1920x1080 (15000) 세그먼트 길이는 4 초이고 128 Kbps에서 하나의 오디오 표현이 사용됩니다. 750Kbps (65 초), 350Kbps (90Mbps), 2500Kbps (120Mbps), 500Kbps (90Mbps), 700Kbps (700Kbps)와 같이 이용 가능한 대역폭에서 단계적으로 그리고 갑작스러운 변화를 제공하는 [6] (30), 1500 Kbps (30), 2500 Kbps (30), 3500 Kbps (30), 2000 Kbps (30), 1000 Kbps (30) 및 500 Kbps (85) 네트워크 지연은 70 밀리 초로 설정됩니다.

Figure 4 shows the download bitrate for the players and algorithms in question, and Table I provides an overview of all metrics. Metrics a.–e. are directly retrieved from the player/HTML5 API and algorithm implementation, respectively. Metrics f.–g. utilize simple QoE models [19], [20] to calculate MOS values ranging from one to ﬁve based on a subset of other metrics. Interestingly, industry players and research algorithms provide different performance behavior under the same conditions but can be directly compared among each other. A detailed analysis of the results is out of scope of this paper.

그림 4는 해당 플레이어 및 알고리즘의 다운로드 비트 레이트를 보여 주며 표 1은 모든 메트릭의 개요를 제공합니다. 통계 a.-e. 플레이어 / HTML5 API 및 알고리즘 구현에서 직접 가져옵니다. 통계 f.-g. 단순한 QoE 모델 [19], [20]을 사용하여 다른 메트릭의 서브 세트를 기반으로 1에서 5까지의 MOS 값을 계산합니다. 흥미롭게도 업계의 플레이어와 연구 알고리즘은 동일한 조건에서 서로 다른 성능 동작을 제공하지만 서로 직접 비교할 수 있습니다. 결과에 대한 자세한 분석은이 백서의 범위를 벗어납니다.

IV. DISCUSSION AND CHALLENGES

IV. 토론과 도전

In this section, we provide a discussion about our framework for the automated objective and subjective evaluation of HAS systems. It allows for an easy setup of various conﬁgurations and running multiple evaluations in parallel. New players and algorithms can be added easily as they appear in the market and research literature. Over time it is possible to build up a repository of players and algorithms for comprehensive performance evaluation. As it is possible to run multiple Selenium servers in parallel, our framework is capable to evaluate when players/algorithms compete for bandwidth in various conﬁgurations (e.g., n player A vs. m player B).

이 절에서는 HAS 시스템의 자동화 된 목표 및 주관적 평가를위한 프레임 워크에 대해 설명합니다. 다양한 구성을 손쉽게 설정하고 여러 평가를 동시에 실행할 수 있습니다. 새로운 플레이어와 알고리즘은 시장 및 연구 문헌에 나타나기 때문에 쉽게 추가 할 수 있습니다. 시간이 지남에 따라 종합적인 성능 평가를 위해 플레이어 및 알고리즘 저장소를 구축 할 수 있습니다. 병렬로 여러 개의 Selenium 서버를 실행할 수 있기 때문에 우리의 프레임 워크는 플레이어 / 알고리즘이 다양한 구성 (예 : n 플레이어 A 대 m 플레이어 B)의 대역폭을두고 경쟁 할 때를 평가할 수 있습니다.

The framework is quite ﬂexible, and thus, comes with a high number of degrees of freedom. Hence, it is important to design the evaluation carefully. Here we provide a brief list of the aspects to consider: (A) Content assets: content type, codec/coding parameters (including HDR, WCG), representations (bitrate/resolution pairs, also referred to as bitrate ladder), segment length (including GOPsize),representationformat(i.e.,DASH,HLS,CMAF), etc.; (B) Network parameters: bandwidth trajectory (i.e., predeﬁned, network traces), delay, loss, and other networking aspects (see below for further details); (C) End user device environment: device type, operating system, browser, etc.; (D) Streaming performance metrics: average bitrate, startup time, stalls (frequency, duration), quality switches (frequency, amplitude), etc.; (E) Quantitative QoE models based on audio-video quality and/or streaming performance metrics; (F) General HAS evaluation setup: live vs. ondemand content, single player vs. multiple players competing for bandwidth, etc.; (G) Templates for generating the impaired media sequence (i.e., how to realize startup delay and stalls); (H) Questionnaire for SQA including control questions for crowdsourcing; (I) SQA method (e.g., single stimulus, double stimulus, pair-wise comparison) and its parametrization; and (J) Collection of all results and further (ofﬂine) analysis.

이 프레임 워크는 매우 융통성이 있으며 따라서 자유도가 높습니다. 따라서 평가를 신중하게 설계하는 것이 중요합니다. (A) 콘텐츠 자산 : 콘텐츠 유형, 코덱 / 코딩 매개 변수 (HDR, WCG 포함), 표현 (비트율 / 해상도 쌍, 비트 전송률 래더라고도 함), 세그먼트 길이 ( GOPsize 포함), 표현 형식 (즉, DASH, HLS, CMAF) 등. (B) 네트워크 매개 변수 : 대역폭 궤도 (예 : 미리 정의 된 네트워크 추적), 지연, 손실 및 기타 네트워킹 측면 (자세한 내용은 아래 참조). (C) 최종 사용자 장치 환경 : 장치 유형, 운영 체제, 브라우저 등; (D) 스트리밍 성능 메트릭 : 평균 비트 레이트, 시작 시간, 스톨 (주파수, 지속 시간), 품질 스위치 (주파수, 진폭) 등. (E) 오디오 - 비디오 품질 및 / 또는 스트리밍 성능 메트릭에 기초한 정량적 QoE 모델. (F) 일반 HAS 평가 설정 : 라이브 대 온 디맨드 컨텐츠, 싱글 플레이 대 멀티 플레이 대 대역폭 경쟁 등.; (G) 손상된 미디어 시퀀스를 생성하기위한 템플릿 (즉, 기동 지연 및 실속을 실현하는 방법); (H) crowdsourcing에 대한 통제 질문을 포함하는 SQA에 대한 질문지; (I) SQA 방법 (예를 들면, 단일 자극, 이중 자극, 쌍 - 방향 비교) 및 그 매개 변수화; (J) 모든 결과 및 더 이상의 (플린) 분석의 수집.

All these aspects are important to consider and a potential source of risk when conducting such experiments. A detailed analysis is beyond the scope of this paper.

이러한 모든 측면을 고려하는 것이 중요하며 그러한 실험을 수행 할 때 잠재적 위험 원이 될 수 있습니다. 자세한 분석은이 백서의 범위를 벗어납니다.

Based on our experience when conducting multiple evaluations and performance comparisons, we identiﬁed the following research challenges, possibly subject to future work: (I) The reliability of results require cross-validation, speciﬁcally those from SQAs, which typically call for SQAs in controlled laboratory environments. (II) The network is a key aspect within HAS systems but often neglected. Network emulation is a vital tool but with limitations. For HAS systems, we also need to consider content distribution networks (CDNs), software-deﬁned networking (SDN), information-centric networking (ICN), and next-generation (mobile) networks (e.g., 5G). Detailed analysis and evaluations of these aspects in the context of HAS are currently missing. However, for example, recent standardization and research contributions showed beneﬁts for HAS systems when combined with SDN [21]. (III) Reproducibility of such a framework can be achieved be providing containerized versions of the modules as done in [10]. This is considered critical when using industry players, which often require licenses. Additionally, it could be interesting to connect to large-scale research networks (such as PlanetLab11, Virtual Internet Routing Lab12, or GENI13).

여러 평가 및 성능 비교를 수행 한 경험을 토대로 다음과 같은 연구 과제를 확인했습니다. (I) 결과의 신뢰성에는 교차 검증이 필요하며 특히 SQAs의 경우 대조 연구실에서 SQA를 요구합니다 환경. (II) 네트워크는 HAS 시스템에서 핵심적인 측면이지만 종종 소홀히합니다. 네트워크 에뮬레이션은 필수 도구이지만 한계가 있습니다. HAS 시스템의 경우 CDN (Content Distribution Network), SDN (Software-Defined Network), ICN (Information-Centric Networking) 및 차세대 (모바일) 네트워크 (예 : 5G)도 고려해야합니다. HAS의 맥락에서 이러한 측면에 대한 상세한 분석과 평가는 현재 실종 상태입니다. 그러나 최근의 표준화 및 연구 기여는 SDN과 결합 될 때 HAS 시스템에 대한 이점을 보여 주었다. (III) [10]에서와 같이 모듈의 컨테이너 화 된 버전을 제공함으로써 그러한 프레임 워크의 재현성을 얻을 수있다. 이는 종종 라이센스가 필요한 산업 플레이어를 사용할 때 중요합니다. 또한 대규모 연구 네트워크 (예 : PlanetLab11, Virtual Internet Routing Lab12 또는 GENI13)에 연결하는 것이 흥미로울 수 있습니다.

V. CONCLUSIONS

V. 결론

This paper describes how AdViSE and WESP can be combined to perform objective and subjective evaluations of HAS systems in a fully automated and scalable way. For example, it can be used to test and compare new players/algorithms under various context conditions or research new QoE models with practically instant veriﬁcation through subjective tests. The main ﬁnding of this work is, that a comprehensive objective and subjective evaluation of HAS systems is feasible for both industry players and adaptation algorithms proposed in the research literature. Hence, we recommend adopting it when proposing new features in this area and evaluating these features against the state of the art.

이 백서에서는 AdViSE와 WESP를 결합하여 완전히 자동화되고 확장 가능한 방식으로 HAS 시스템의 객관적이고 주관적인 평가를 수행하는 방법에 대해 설명합니다. 예를 들어, 다양한 상황에서 새로운 플레이어 / 알고리즘을 테스트 및 비교하거나 새로운 QoE 모델을 주관적인 테스트를 통해 사실적으로 검증 할 수 있습니다. 이 연구의 주된 결론은 HAS 시스템에 대한 포괄적 인 객관적이고 주관적인 평가가 산업계 연구자와 연구 논문에서 제안 된 적응 알고리즘 모두에 적합하다는 것이다. 따라서이 영역에 새로운 기능을 제안하고 이러한 기능을 최첨단에 대해 평가할 때 채택하는 것이 좋습니다.

저작자표시 (새창열림)

'# 03 > 프로토콜' 카테고리의 다른 글

Comparing Adaptive HTTP Streaming Technologies-2 (0)	2019.02.05
Comparing Adaptive HTTP Streaming Technologies-1 (0)	2019.02.05
MPEG DASH (0)	2019.02.05
HLS (0)	2019.02.05
HDS (0)	2019.02.05

행복한 개발자

Automated Objective and Subjective Evaluation of HTTP Adaptive Streaming Systems

'# 03 > 프로토콜' 카테고리의 다른 글

티스토리툴바

Automated Objective and Subjective Evaluation of HTTP Adaptive Streaming Systems

'# 03 > 프로토콜' 카테고리의 다른 글

'# 03/프로토콜' Related Articles

티스토리툴바