- Tencent: Video Technologies Towards Human Perception
- Microsoft: Machine Learning for Computer Vision Applications
- Google: Advances in Understanding and Processing User-generated Video
- Qualcomm: Depth Sensing for Mobile Phones, Industrial Automation and Home Security
- Facebook: Video Processing at Facebook - How to Increase Quality and Power Efficiency at Scale
Title: Video Technologies Towards Human Perception
Date: Sept. 23
With recent advances in deep neural networks, video technologies also took a big leap towards serving human perception in a more refined manner. This includes artifact reduction, saliency detection/protection, and no-reference quality assessment. In the panel workshop, we will share various experiences along the way of building the human perceptual video service, Kanasky (a.k.a. Tencent Liying).
Firstly, CNN based video processing and enhancement techniques that boost human perception will be discussed. With the major video source on the internet being UGC contents, these methods help to improve the viewing experience by mean opinion score (MOS) statistics. In the second session, we will introduce a no-reference quality assessment method that can be used to evaluate various video processing algorithms. To address the label shortage problem and the fixed-size input constraint of CNN models, the proposed method is developed based on rank learning and fully convolutional network with effective feature pooling. Lastly, the assessment effort is well extended to deal with video, where quality exhibits different characteristics from static image quality due to temporal masking effects. An architecture using Convolutional Neural Network with 3D kernels (C3D) is used and demonstrates better performance than state-of-the-art methods.
Tencent Media Lab (TML) dedicates to cutting-edge research on a broad spectrum of multimedia technologies, ranging from high quality on-demand video service, webcasting, real-time audio video communications, to multimedia standardization. By serving billions of Tencent customers for over a decade, TML is recognized as the elite leader and pioneer in the multimedia industry, with fruitful research contributions and award-winning innovations.
Meng-Ping Kao is currently a principal researcher at Tencent Media Lab, leading the Kanasky project (a.k.a. Tencent Liying). He received his Ph.D. from the ECE Dept., University of California, San Diego in 2008. His research interests include video technologies, human perception, machine learning, and computer vision.
Yabin Zhang is currently a senior researcher at Tencent Media Lab. He received the B.E. degree in Electronic Information Engineering in the Honors School, Harbin Institute of Technology and the Ph.D. degree from the School of Computer Science and Engineering, Nanyang Technological University, Singapore in 2013 and 2018, respectively. His research interests include video coding, image/video processing, image/video quality assessment and computer vision.
Haiqiang Wang received his PhD at University of Southern California in 2018. He is a member of Media Lab at Tencent. His research interests include Video Quality Assessment (VQA) on both UGC and PGC videos, Video Coding and Machine Learning.
Title: Machine Learning for Computer Vision Applications
Date: Sept. 23
This workshop consists of three talks. The topics are 3D skeletal tracking on Azure Kinect, Optical Character Recognition (OCR) and its applications, and towards practical solutions for 3D face tracking and reconstruction. The contents for these three talks are as follows:
Microsoft has built a new RGB-D sensor, called Azure Kinect, and released Azure Kinect DK which is a developer kit and PC peripheral for computer vision and speech recognition models (https://azure.microsoft.com/en-us/services/kinect-dk/). In this talk, we will briefly introduce the hardware and describe in more detail about the Azure Kinect Body Tracking SDK which is a neural network based solution for the 3D skeletal tracking with the new RGB-D sensor.
OCR is an image processing task that has been studied for tens of years. Due to the recent development of deep learning algorithms, OCR has also gone through major algorithm redesign to achieve much better accuracy. In this talk, we will give a brief overview of our efforts to build a state-of-the-art OCR engine, and how to apply it in enterprise applications.
Both 3D face tracking and reconstruction have various important applications, although the algorithmic progress in the academy is significant in the recent years due to the advances of deep learning, building robust and practical solutions are still very challenging for many scenarios. In this talk, we will present a few of our ongoing research works to share the experience of how we address those challenges. Specifically, we will talk about how to develop an end2end RGB-based 3D face tracker that works in real-time on mobile devices, then share the latest progress of a scalable 3D face reconstruction system.
Zicheng Liu is currently a principal research manager at Microsoft. His current research interests include human pose estimation and activity understanding. He received a Ph.D. in Computer Science from Princeton University, a M.S. in Operational Research from the Institute of Applied Mathematics, Chinese Academy of Science, and a B.S. in Mathematics from Huazhong Normal University, China. Before joining Microsoft Research in 1997, he worked at Silicon Graphics as a member of technical staff and shipped OpenGL NURBS tessellator and OpenGL Optimizer. He has co-authored three books: “Face Geometry and Appearance Modeling: concept and applications”, Cambridge University Press, “Human Action Recognition with Depth Cameras”, Springer Briefs, and “Human Action Analysis with Randomized Trees”, Springer Briefs. He was a technical co-chair of 2010 and 2014 IEEE International Conference on Multimedia and Expo, and a general co-chair of 2012 IEEE Visual Communication and Image Processing. He is the Editor-in-Chief of the Journal of Visual Communication and Image Representation. He served as a Steering Committee member of IEEE Transactions on Multimedia. He was a distinguished lecturer of IEEE CAS from 2015-2016. He was the chair of IEEE CAS Multimedia Systems and Applications technical committee from 2015-2017. He is a fellow of IEEE.
Cha Zhang is a principal engineering manager at Microsoft Cloud & AI working on computer vision. He received the B.S. and M.S. degrees from Tsinghua University, Beijing, China in 1998 and 2000, respectively, both in Electronic Engineering, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University, in 2004. After graduation, he worked at Microsoft Research for 12 years investigating research topics including multimedia signal processing, computer vision and machine learning. He has published more than 100 technical papers and hold more than 30 U.S. patents. He served as Program Co-Chair for VCIP 2012 and MMSP 2018, and General Co-Chair for ICME 2016. He is a Fellow of the IEEE. Since joining Cloud & AI, he has led teams to ship industry-leading technologies in Microsoft Cognitive Services such as emotion recognition and optical character recognition.
Baoyuan Wang is currently a principal research manager at Microsoft Cognition vision team, Redmond, US. His research interest include automatic 3D content creation, computational photography as well as deep learning applications. He has shipped several key technologies to various Microsoft products including Bing maps, Xbox/Kinect, Microsoft Pix camera and Swift-key. Dr. Wang got both his Ph.D. and bachelor degree from Zhejiang University in 2007 and 2012, respectively.
Title: Advances in Understanding and Processing User-generated Video
Date: Sept. 24
The workshop is split into three sections. The first section’s focus is on datasets and analysis for user generated content (UGC). Most videos uploaded to YouTube are generated by non-professional creators with consumer devices. Traditional video quality metrics used in compression and quality assessment, like BD-Rate and PSNR, are designed for pristine originals and consequently their accuracy and utility drops when applied on non-pristine originals, i.e., majority of UGC. Here, we will address challenges in compression and quality assessment of such content, and introduce our recent work on YouTube UGC dataset.
The third section talks about MediaPipe, a graph-based framework for building multi-modal ML perception pipelines. It is developed at Google, widely adopted in Google research and products, and now open source and available to all ML researchers and practitioners. With MediaPipe, a perception pipeline can be built as a graph of modular components, including, for instance, inference models and media processing functions. Sensory data such as video streams enter the graph, and perceived descriptions such as object-localization and face-landmark streams exit the graph. In this section, an overview of MediaPipe will be presented together with use-case examples enabling real-time perception in the camera viewfinder on mobile devices.
Sasi Inguva received his bachelors degree in Computer Science from IIT Madras in 2012, with special focus on Computer Vision. After graduation, he joined Media Algorithm team in Youtube/Google. His research fields include video processing infrastructure, 3d reconstruction from videos and video quality assessment.
Yilin Wang received his PhD from the University of North Carolina at Chapel Hill in 2014, working on topics in computer vision and image processing. After graduation, he joined the Media Algorithm team in Youtube/Google. His research fields include video processing infrastructure, video quality assessment, and video compression.
Balu Adsumilli currently manages and leads the Media Algorithms group at YouTube/Google. He did his masters in University of Wisconsin Madison in 2002, and his PhD at University of California Santa Barbara in 2005, on watermark-based error resilience in video communications. From 2005 to 2011, he was Sr. Research Scientist at Citrix Online, and from 2011-2016, he was Sr. Manager Advanced Technology at GoPro, at both places developing algorithms for images/video quality enhancement, compression, capture, and streaming. He is an active member of IEEE (and MMSP TC), ACM, SPIE, and VES, and has co-authored more than 120 papers and patents. His fields of research include image/video processing, machine vision, video compression, spherical capture, VR/AR, visual effects, and related areas.
Dr. Debargha Mukherjee received his M.S./Ph.D. degrees in ECE from University of California Santa Barbara in 1999. Thereafter, through 2009 he was with Hewlett Packard Laboratories, conducting research on video/image coding and processing. Since 2010 he has been with Google Inc., where he is currently a Principal Engineer involved with open-source video codec research and development, notably VP9 and AV1. Prior to that he was responsible for video quality control and 2D-3D conversion on YouTube. Debargha has authored/co-authored more than 100 papers on various signal processing topics, and holds more than 60 US patents, with many more pending. He has delivered many workshops and talks on Google's royalty-free line of codecs since 2012, and more recently on the AV1 video codec from the Alliance for Open Media (AOM). He has served as Associate Editors of the IEEE Trans. on Circuits and Systems for Video Technology and IEEE Trans. on Image Processing. He is also a member of the IEEE Image, Video, and Multidimensional Signal Processing Technical Committee (IVMSP TC).
Chuo-Ling Chang is currently a Technical Lead Manager at Google Research. Prior to joining Google in 2014, he worked at multiple startup companies leading research and development of multimedia coding, processing and interactive streaming systems. He received his Ph.D. degree from Information Systems Laboratory at Stanford University, CA.
Title: Depth Sensing for Mobile Phones, Industrial Automation and Home Security
Date: Sept. 24
This session is an introduction to SLiM - an implementation of the Qualcomm depth sensing reference design by Himax. The workshop will center around a demo of the SLiM depth sensing module and its capabilities. The performance of the module is discussed and its usage along with several reference applications implemented using the provided SDK.
Champ Yen is an Application Engineer at Qualcomm Taiwan Corporation. He received a B.S. degree in computer science information engineering from the National Cheng Kung University in 2001, and a M.S. degree in computer science information engineering from the National Chao Tung University of Taiwan in 2003. He provides support for customer application development, including algorithm porting, problem solving, and technical support. Champ has significant experience in GPGPU, DSP and domain specific programming. In recent years he has worked specifically on optimization and development of camera and computer vision applications.
Eric is the Sales and Marketing AVP of SLiM and AI-based CV product line of Himax Technologies, Inc. (NASDQ: HIMX), a Taiwan-based fabless semiconductor solution provider dedicated to display imaging processing technologies. Eric has been in the semiconductor business for over 30 years. Eric joined Himax over 8 years ago and participated in Himax image processing, TCON, optical & nanoimprinting, structured light projector, 3D structured light sensing SLiM, and AI-based computer vision solution in ASIC and SOC systems with various successful design-wins with world tier-1 cell phone, NB, TV and projector companies.
Title: Video Processing at Facebook - How to Increase Quality and Power Efficiency at Scale
Date: Sept. 25
Facebook is the world's largest social network, offering a variety of products that support video, such as Facebook Live, Facebook Watch, Instagram TV (IGTV), Messenger and WhatsApp video calling and Oculus/Portal hardware that allow user immersion. We handle both premium and user-generated content at varying source qualities and we are making it available all over the world over highly variable network conditions. We use adaptive bitrate streaming to maximize quality but also end-to-end encryption to protect our members’ privacy. Video processing is taking place in our own datacenters, where our focus is on the highest level of security, availability, quality and energy efficiency. In our session we will cover topics such as how we measure video quality at scale, what we do to maximize such quality and what steps we take to reduce the energy requirements of all video processing in our datacenters. We will highlight some of our research initiatives in this space and include a panel discussion with world experts on what are the challenges and possible research directions in efficient video processing.
Dr. Ioannis Katsavounidis
Dr. Ioannis Katsavounidis is a member of Video Fundamentals and Research, part of the Video Infrastructure team, leading technical efforts in improving video quality across all video products at Facebook. Before joining Facebook, he spent 3.5 years at Netflix, contributing to the development and popularization of VMAF, Netflix's video quality metric, as well as inventing the Dynamic Optimizer, a shot-based video quality optimization framework that brought significant bitrate savings across the whole streaming spectrum. Before that, he was a professor for 8 years at the University of Thesally's Electrical Engineering Department in Greece, teaching video compression, signal processing and information theory. He has over 100 publications and patents in the general field of video coding, but also high energy experimental physics. His research interests lie in video coding, video quality, adaptive streaming and hardware/software partitioning of multimedia processing.
Dr. Mani Malek Esmaeili
Mani Malek Esmaeili received his PhD at University of British Columbia. His research interests are multimedia retrieval, computer vision, and the general problem of approximate nearest neighbor search. He has been working at Facebook’s video infrastructure group as an algorithm developer. He has been leading the Media copyright’s team algorithm development for the past year.
Dr. Nick Wu
Nick Wu is a member of Video Fundamentals and Research at Facebook working on building key video infrastructure pieces to serve video@scale. He received his Ph.D. in Electrical Engineering from University of Southern California in 2010. After that, he worked on video content analysis and its application on indexing, ranking, and recommendations in a video search engine. He then spent several years in the cloud gaming space, working on interactive video streaming, graphics, and virtualization. Prior to joining Facebook, he was at Google working on Google App Streaming and Android virtual device. His research interests lie in video coding, adaptive streaming, and video content analysis.
Shankar Regunathan is a member of Video Algorithms at Facebook working on video quality measurement and encoding improvements with particular focus on user generated content. Prior to joining Facebook, he spent several years at Microsoft working on VC-1, JPEG-XR and contributions towards AVC/SVC. He received a Ph.D in EE from University of California, Santa Barbara. He has received the IEEE Signal Processing Society Best Paper Award in 2004 and 2007. His research interests lie at the intersection of video compression, signal processing and coding theory.