Academic Press Library in Signal Processing, Volume 6: Image and Video Processing and Analysis and Computer Vision
()
About this ebook
Academic Press Library in Signal Processing, Volume 6: Image and Video Processing and Analysis and Computer Vision is aimed at university researchers, post graduate students and R&D engineers in the industry, providing a tutorial-based, comprehensive review of key topics and technologies of research in both image and video processing and analysis and computer vision. The book provides an invaluable starting point to the area through the insight and understanding that it provides.
With this reference, readers will quickly grasp an unfamiliar area of research, understand the underlying principles of a topic, learn how a topic relates to other areas, and learn of research issues yet to be resolved.
- Presents a quick tutorial of reviews of important and emerging topics of research
- Explores core principles, technologies, algorithms and applications
- Edited and contributed by international leading figures in the field
- Includes comprehensive references to journal articles and other literature upon which to build further, more detailed knowledge
Related to Academic Press Library in Signal Processing, Volume 6
Related ebooks
Digital Image Enhancement and Reconstruction Rating: 0 out of 5 stars0 ratingsStatistical Shape and Deformation Analysis: Methods, Implementation and Applications Rating: 0 out of 5 stars0 ratingsInternet of Multimedia Things (IoMT): Techniques and Applications Rating: 0 out of 5 stars0 ratingsConformal Prediction for Reliable Machine Learning: Theory, Adaptations and Applications Rating: 0 out of 5 stars0 ratingsBig Data: Principles and Paradigms Rating: 0 out of 5 stars0 ratingsInternet of Things: Principles and Paradigms Rating: 4 out of 5 stars4/5Managing the Web of Things: Linking the Real World to the Web Rating: 0 out of 5 stars0 ratingsComputational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications Rating: 0 out of 5 stars0 ratingsTrolley Crash: Approaching Key Metrics for Ethical AI Practitioners, Researchers, and Policy Makers Rating: 0 out of 5 stars0 ratingsGenerative Adversarial Networks for Image-to-Image Translation Rating: 0 out of 5 stars0 ratingsManaging Trade-offs in Adaptable Software Architectures Rating: 0 out of 5 stars0 ratingsIntelligent Edge Computing for Cyber Physical Applications Rating: 0 out of 5 stars0 ratingsSpectral Geometry of Shapes: Principles and Applications Rating: 0 out of 5 stars0 ratingsSemantic Models in IoT and eHealth Applications Rating: 0 out of 5 stars0 ratingsComputer Vision for Microscopy Image Analysis Rating: 0 out of 5 stars0 ratingsThe Art and Science of Analyzing Software Data Rating: 0 out of 5 stars0 ratingsMachine Learning and Data Science in the Power Generation Industry: Best Practices, Tools, and Case Studies Rating: 0 out of 5 stars0 ratingsThe Cognitive Approach in Cloud Computing and Internet of Things Technologies for Surveillance Tracking Systems Rating: 0 out of 5 stars0 ratingsSocial Network Analytics: Computational Research Methods and Techniques Rating: 0 out of 5 stars0 ratingsEdge-of-Things in Personalized Healthcare Support Systems Rating: 0 out of 5 stars0 ratingsAdvances in Computational Techniques for Biomedical Image Analysis: Methods and Applications Rating: 0 out of 5 stars0 ratingsCognitive Informatics, Computer Modelling, and Cognitive Science: Volume 2: Application to Neural Engineering, Robotics, and STEM Rating: 0 out of 5 stars0 ratingsData Fusion Techniques and Applications for Smart Healthcare Rating: 0 out of 5 stars0 ratingsMachine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications Rating: 0 out of 5 stars0 ratingsData Analytics for Intelligent Transportation Systems Rating: 0 out of 5 stars0 ratingsA Beginner's Guide to Data Agglomeration and Intelligent Sensing Rating: 0 out of 5 stars0 ratingsDeep Learning for Robot Perception and Cognition Rating: 4 out of 5 stars4/5Cyber-Physical Systems: AI and COVID-19 Rating: 0 out of 5 stars0 ratingsDeep Learning and Parallel Computing Environment for Bioengineering Systems Rating: 0 out of 5 stars0 ratings
Electrical Engineering & Electronics For You
Electrician's Pocket Manual Rating: 0 out of 5 stars0 ratingsHow to Diagnose and Fix Everything Electronic, Second Edition Rating: 4 out of 5 stars4/5Electrical Engineering 101: Everything You Should Have Learned in School...but Probably Didn't Rating: 5 out of 5 stars5/5Electricity for Beginners Rating: 5 out of 5 stars5/5Beginner's Guide to Reading Schematics, Fourth Edition Rating: 4 out of 5 stars4/5Electronic Circuits for the Evil Genius 2/E Rating: 0 out of 5 stars0 ratingsElectrical Engineering: Know It All Rating: 4 out of 5 stars4/5Upcycled Technology: Clever Projects You Can Do With Your Discarded Tech (Tech gift) Rating: 5 out of 5 stars5/5Practical Electrical Wiring: Residential, Farm, Commercial, and Industrial Rating: 4 out of 5 stars4/5DIY Lithium Battery Rating: 3 out of 5 stars3/5Basic Electricity Rating: 4 out of 5 stars4/5The Homeowner's DIY Guide to Electrical Wiring Rating: 5 out of 5 stars5/5Beginner's Guide to Reading Schematics, Third Edition Rating: 0 out of 5 stars0 ratingsOff-Grid Projects: Step-by-Step Guide to Building Your Own Off-Grid System Rating: 0 out of 5 stars0 ratingsElectronics Explained: Fundamentals for Engineers, Technicians, and Makers Rating: 5 out of 5 stars5/5THE Amateur Radio Dictionary: The Most Complete Glossary of Ham Radio Terms Ever Compiled Rating: 4 out of 5 stars4/5Soldering electronic circuits: Beginner's guide Rating: 4 out of 5 stars4/5Solar & 12 Volt Power For Beginners Rating: 4 out of 5 stars4/5No Nonsense General Class License Study Guide: for Tests Given Between July 2019 and June 2023 Rating: 4 out of 5 stars4/5Programming Arduino: Getting Started with Sketches Rating: 4 out of 5 stars4/5Raspberry Pi Projects for the Evil Genius Rating: 0 out of 5 stars0 ratingsProgramming the Raspberry Pi, Third Edition: Getting Started with Python Rating: 5 out of 5 stars5/5Electronics Engineering Rating: 0 out of 5 stars0 ratingsThe Ultimate Solar Power Design Guide Less Theory More Practice Rating: 4 out of 5 stars4/5How Do Electric Motors Work? Physics Books for Kids | Children's Physics Books Rating: 0 out of 5 stars0 ratingsUnderstanding Electricity Rating: 4 out of 5 stars4/5Electronics Simplified Rating: 4 out of 5 stars4/5Ramblings of a Mad Scientist: 100 Ideas for a Stranger Tomorrow Rating: 0 out of 5 stars0 ratingsPhysics DeMYSTiFieD, Second Edition Rating: 0 out of 5 stars0 ratings
Reviews for Academic Press Library in Signal Processing, Volume 6
0 ratings0 reviews
Book preview
Academic Press Library in Signal Processing, Volume 6 - Academic Press
Academic Press Library in Signal Processing, Volume 6
Image and Video Processing and Analysis and Computer Vision
First Edition
Rama Chellappa
Department of Electrical and Computer Engineering and Center for Automation Research, University of Maryland, College Park, MD, USA
Sergios Theodoridis
Department of Informatics & Telecommunications, University of Athens, Greece
Table of Contents
Cover image
Title page
Copyright
Contributors
About the Editors
Section Editors
Introduction
Section 1: Visual Information Processing
Chapter 1: Multiview video: Acquisition, processing, compression, and virtual view rendering
Abstract
Acknowledgments
1.1 Multiview Video
1.2 Multiview Video Acquisition
1.3 Multiview Video Preprocessing
1.4 Depth Estimation
1.5 View Synthesis and Virtual Navigation
1.6 Compression
1.7 Future Trends
Glossary
Chapter 2: Plenoptic imaging: Representation and processing
Abstract
Acknowledgments
2.1 Introduction
2.2 Light Representation: The Plenoptic Function Paradigm
2.3 Empowering the Plenoptic Function: Example Use Cases
2.4 Plenoptic Acquisition and Representation Models
2.5 Plenoptic Data Coding
2.6 Plenoptic Data Rendering
2.7 Plenoptic Representations Relationships
2.8 Related Standardization Initiatives
2.9 Future Trends and Challenges
Glossary
Chapter 3: Visual attention, visual salience, and perceived interest in multimedia applications
Abstract
3.1 Introduction
3.2 Classification of Attention Mechanisms
3.3 Computational Models of Visual Attention
3.4 Acquiring Ground Truth Visual Attention Data for Model Verification
3.5 Applications of Visual Attention
Chapter 4: Emerging science of QoE in multimedia applications: Concepts, experimental guidelines, and validation of models
Abstract
4.1 QoE Definition and Influencing Factors
4.2 QoE Measurement
4.3 Performance Evaluation of Objective QoE Estimators
4.4 Conclusion
Section 2: Computational Imaging and 3D Analysis
Chapter 5: Computational photography
Abstract
5.1 Introduction
5.2 Breaking Precepts Underlying Photography
5.3 Cameras With Novel Form Factors and Capabilities
5.4 Solving Inverse Problems
5.5 Conclusions
Chapter 6: Face detection with a 3D model
Abstract
6.1 Introduction
6.2 Face Detection Using a 3D Model
6.3 Parameter Sensitive Model
6.4 Fitting 3D Models
6.5 Experiments
6.6 Conclusions and Future Trends
Chapter 7: A survey on nonrigid 3D shape analysis
Abstract
Acknowledgments
7.1 Introduction
7.2 General Formulation
7.3 Shape Spaces and Metrics
7.4 Registration and Geodesics
7.5 Statistical Analysis Under Elastic Metrics
7.6 Examples and Applications
7.7 Summary and Perspectives
Chapter 8: Markov models and MCMC algorithms in image processing
Abstract
8.1 Introduction: The Probabilistic Approach in Image Analysis
8.2 Lattice-based Models and the Bayesian Paradigm
8.3 Some Inverse Problems
8.4 Spatial Point Processes
8.5 Multiple Objects Detection
8.6 Conclusion
Section 3: Image and Video-Based Analytics
Chapter 9: Scalable image informatics
Abstract
Acknowledgments
9.1 Introduction
9.2 Core Concepts
9.3 Basic micro-services
9.4 Analysis Modules
9.5 Building on the Concepts: Sparse Images
9.6 Feature Services and Machine Leaning
9.7 Application Example: Annotation and Classification of Underwater Images
9.8 Summary
Chapter 10: Person re-identification
Abstract
10.1 Introduction
10.2 The re-identification Problem: Scenarios, Taxonomies, and Related Work
10.3 Experimental Evaluation of re-id Datasets and Their Characteristics
10.4 The SDALF Approach
10.5 Metric Learning
10.6 Conclusions and New Challenges
Chapter 11: Social network inference in videos
Abstract
11.1 Introduction
11.2 Related Work
11.3 Video Shot Segmentation
11.4 Actor Recognition
11.5 Learning to Group Actors
11.6 Inferring Social Communities
11.7 Social Network Analysis
11.8 Experiments
11.9 Latent Features
11.10 Summary
Index
Copyright
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1800, San Diego, CA 92101-4495, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
© 2018 Elsevier Ltd. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN 978-0-12-811889-4
For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner
Acquisition Editor: Tim Pitts
Editorial Project Manager: Charlotte Kent
Production Project Manager: Sujatha Thirugnana Sambandam
Cover Designer: Mark Rogers
Typeset by SPi Global, India
Contributors
Adrian Barbu Statistics Department, Florida State University, Tallahassee, FL 32306
Patrick Le Callet University of Nantes, Nantes, France
Marco Cristani University of Verona, Verona, Italy
Eduardo A.B. da Silva Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
Xavier Descombes Université Côte d'Azur, INRIA, I3S, IBV, I3S, Sophia-Antipolis cedex, France
Marek Domański Poznań University of Technology, Poznań, Poland
Dmitry Fedorov University of California, Santa Barbara, CA, United States
Gary Gramajo New Ventures Analyst, Chick-fil-A Corporation, Atlanta, GA 30349
Ashish Gupta Ohio State University, Columbus, OH, United States
Lukáš Krasula University of Nantes, Nantes, France
Kristian Kvilekval University of California, Santa Barbara, CA, United States
Gauthier Lafruit Brussels University (French wing), Brussels, Belgium
Hamid Laga
Murdoch University, Perth, WA, Australia
Phenomics and Bioinformatics Research Centre, University of South Australia
Christian A. Lang University of California, Santa Barbara, CA, United States
Nathan Lay Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center, National Institutes of Health, Bethesda, MD 20892
B.S. Manjunath University of California, Santa Barbara, CA, United States
Vittorio Murino Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genova, Italy
Adithya K. Pediredla Rice University, Houston, TX
Fernando Pereira Instituto Superior Técnico, Universidade de Lisboa – Instituto de Telecomunicações, Lisboa, Portugal
Yashas Rai University of Nantes, Nantes, France
Aswin C. Sankaranarayanan Carnegie Mellon University, Pittsburgh, PA
Olgierd Stankiewicz Poznań University of Technology, Poznań, Poland
Ashok Veeraraghavan Rice University, Houston, TX
Alper Yilmaz Ohio State University, Columbus, OH, United States
About the Editors
Prof. Rama Chellappa is a distinguished university professor, a Minta Martin Professor in Engineering and chair of the Department of Electrical and Computer Engineering at the University of Maryland, College Park, MD. He received his BE (Hons.) degree in Electronics and Communication Engineering from the University of Madras, India and the ME (with distinction) degree from the Indian Institute of Science, Bangalore, India. He received his MSEE and PhD degrees in Electrical Engineering from Purdue University, West Lafayette, IN. At UMD, he was an affiliate professor of Computer Science Department, Applied Mathematics, and Scientific Computing Program, a member of the Center for Automation Research and a permanent member of the Institute for Advanced Computer Studies. His current research interests span many areas in image processing, computer vision, and machine learning.
Prof. Chellappa is a recipient of an NSF Presidential Young Investigator Award and four IBM Faculty Development Awards. He received the K.S. Fu Prize from the International Association of Pattern Recognition (IAPR). He is a recipient of the Society, Technical Achievement, and Meritorious Service Awards from the IEEE Signal Processing Society. He also received the Technical Achievement and Meritorious Service Awards from the IEEE Computer Society. Recently, he received the Inaugural Leadership Award from the IEEE Biometrics Council. At UMD, he received numerous college- and university-level recognitions for research, teaching, innovation, and mentoring of undergraduate students. In 2010, he was recognized as an Outstanding ECE by Purdue University. He received the Distinguished Alumni Award from the Indian Institute of Science in 2016. He is a fellow of IEEE, IAPR, OSA, AAAS, ACM, and AAAI and holds six patents to his credit.
Prof. Chellappa served the EIC of IEEE Transactions on Pattern Analysis and Machine Intelligence, as the co-EIC of Graphical Models and Image Processing, as an associate editor of four IEEE Transactions, as a co-guest editor of many special issues, and is currently on the Editorial Board of SIAM Journal of Imaging Science and Image and Vision Computing. He has also served as the general and technical program chair/co-chair for several IEEE International and National Conferences and Workshops. He is a golden core member of the IEEE Computer Society, served as a distinguished lecturer of the IEEE Signal Processing Society and as the president of IEEE Biometrics Council.
Sergios Theodoridis is currently professor of Signal Processing and Machine Learning in the Department of Informatics and Telecommunications of the University of Athens. His research interests lie in the areas of Adaptive Algorithms, Distributed and Sparsity—Aware Learning, Machine Learning and Pattern Recognition, Signal Processing for Audio Processing, and Retrieval.
He is the author of the book Machine Learning: A Bayesian and Optimization Perspective,
Academic Press, 2015, the co-author of the best-selling book Pattern Recognition,
Academic Press, 4th ed., 2009, the co-author of the book Introduction to Pattern Recognition: A MATLAB Approach,
Academic Press, 2010, the co-editor of the book Efficient Algorithms for Signal Processing and System Identification
, Prentice-Hall 1993, and the co-author of three books in Greek, two of them for the Greek Open University.
He currently serves as editor-in-chief for the IEEE Transactions on Signal Processing. He is editor-in-chief for the Signal Processing Book Series, Academic Press and co-editor-in-chief for the E-Reference Signal Processing, Elsevier.
He is the co-author of seven papers that have received Best Paper Awards including the 2014 IEEE Signal Processing Magazine best paper award and the 2009 IEEE Computational Intelligence Society Transactions on Neural Networks Outstanding Paper Award.
He is the recipient of the 2017 EURASIP Athanasios Papoulis Award, the 2014 IEEE Signal Processing Society Education Award and the 2014 EURASIP Meritorious Service Award. He has served as a Distinguished Lecturer for the IEEE Signal Processing as well as the Circuits and Systems Societies. He was Otto Monstead Guest Professor, Technical University of Denmark, 2012, and holder of the Excellence Chair, Department of Signal Processing and Communications, University Carlos III, Madrid, Spain, 2011.
He has served as president of the European Association for Signal Processing (EURASIP), as a member of the Board of Governors for the IEEE CAS Society, as a member of the Board of Governors (Member-at-Large) of the IEEE SP Society and as a Chair of the Signal Processing Theory and Methods (SPTM) technical committee of IEEE SPS.
He is a fellow of IET, a corresponding fellow of the Royal Society of Edinburgh (RSE), a fellow of EURASIP and a fellow of IEEE.
Section Editors
Section 1
Dr Frederic Dufaux is a CNRS Research Director at Laboratoire des Signaux et Systèmes (L2S, UMR 8506), CNRS—CentraleSupelec—Université Paris-Sud, where he is head of the Telecom and Networking division. He is also editor-in-chief of Signal Processing: Image Communication.
Frédéric received his MSc in Physics and PhD in Electrical Engineering from EPFL in 1990 and 1994, respectively. He has over 20 years of experience in research, previously holding positions at EPFL, Emitall Surveillance, Genimedia, Compaq, Digital Equipment, and MIT.
Frederic is a fellow of IEEE. He was vice general chair of ICIP 2014. He is vice-chair of the IEEE SPS Multimedia Signal Processing (MMSP) Technical Committee, and will continue to be a chair in 2018 and 2019. He is the chair of the EURASIP Special Area Team on Visual Information Processing. He has been involved in the standardization of digital video and imaging technologies, participating both in the MPEG and JPEG Committees. He is the recipient of two ISO awards for his contributions.
His research interests include image and video coding, 3D video, high dynamic range imaging, visual quality assessment, and video transmission over wireless network. He is author or co-author of three books (High Dynamic Range Video,
Digital Holographic Data Representation and Compression,
and Emerging Technologies for 3D Video
), more than 120 research publications, and 17 patents issued or pending.
Section 2
Anuj Srivastava is a professor of Statistics and a distinguished research professor at the Florida State University. He obtained his PhD degree in Electrical Engineering from Washington University in St. Louis in 1996 and was a visiting research associate at Division of Applied Mathematics at Brown University during 1996–97. He joined the Department of Statistics at the Florida State University in 1997 as an assistant professor. He was promoted to the associate professor position in 2003 and to the full professor position in 2007. He has held several visiting positions in Europe, including one as a Fulbright Scholar to the University of Lille, France. His areas of research include statistics on nonlinear manifolds, statistical image understanding, functional analysis, and statistical shape theory. He has published more than 200 papers in refereed journals and proceedings of refereed international conferences. He has been an associate editor for several statistics and engineering journals, including IEEE Transactions on PAMI, IP, and SP. He is a fellow of IEEE, IAPR, and ASA.
Section 3
Dr Sudeep Sarkar is a professor and chair of Computer Science and Engineering and the associate vice president for I-Corps Programs at the University of South Florida in Tampa. He received his MS and PhD degrees in Electrical Engineering, on a University Presidential Fellowship, from The Ohio State University. He is the recipient of the National Science Foundation CAREER Award in 1994, the USF Teaching Incentive Program Award for Undergraduate Teaching Excellence in 1997, the Outstanding Undergraduate Teaching Award in 1998, and the Ashford Distinguished Scholar Award in 2004. He is a fellow of the American Association for the Advancement of Science (AAAS), Institute of Electrical and Electronics Engineers (IEEE) and International Association for Pattern Recognition (IAPR), American Institute for Medical and Biological Engineering (AIMBE), and a fellow and member of the Board of Directors of the National Academy of Inventors (NAI). He has served on many journal boards and is currently the editor-in-chief for Pattern Recognition Letters. He has 25-year expertise in computer vision and pattern recognition algorithms and systems, holds four US patents, licensed technologies, and has published high-impact journal and conference papers.
Amit Roy-Chowdhury received his PhD from the University of Maryland, College Park (UMCP) in Electrical and Computer Engineering in 2002 and joined the University of California, Riverside (UCR) in 2003 where he is a professor of Electrical and Computer Engineering and a cooperating faculty in the Department of Computer Science and Engineering. He leads the Video Computing Group at UCR, with research interests in computer vision, image processing, pattern recognition, and statistical signal processing. His group is involved in research projects related to camera networks, human behavior modeling, face recognition, and bioimage analysis. Prof. Roy-Chowdhury's research has been supported by various US government agencies including the National Science Foundation, and private industries such as Google, NVDIA, CISCO, and Lockheed-Martin. His research group has published close to 200 papers in peer-reviewed journals and top conferences, including approximately 50 journal papers and another approximately 40 in highly competitive computer vision conferences. He is the first author of the book Camera Networks: The Acquisition and Analysis of Videos Over Wide Areas, the first monograph on the topic. His work on face recognition in art was featured widely in the news media, including a PBS/National Geographic documentary and in The Economist. He is on the editorial boards of major journals and program committees of the main conferences in his area. He is a fellow of the IAPR.
Introduction
Following the success of the first edition of the Signal Processing e-reference project, which was very well received by the signal processing community, we are pleased to present the second edition. Our effort in this second phase of the project was to fill in some remaining gaps from the first edition, but mainly to be currently taking into account recent advancements in the general areas of signal, image and video processing, and analytics.
The last 5 years, although in a historical perspective appear to be a short period, in the context of science, engineering, and technology were very dense in terms of results and ideas. The availability of massive data, which we refer to as Big Data, together with advances in Machine Learning and affordable GPUs, has opened up new areas and opportunities. In particular, the application of deep learning networks to problems such as face/object detection, object recognition, face verification/recognition has demonstrated superior performance that was not thought possible just a few years back. We are at a time when caffe,
the software for implementing deep networks, is probably used more than FFT! We take comfort that the basic module in caffe is convolution, the basic building block of signal and image processing.
While one cannot argue against the impressive performance of deep learning methods for a wide variety of problems and time-tested concepts such as statistical models and inference, the role of geometry and physics will continue to be relevant and may even enhance the performance and generalizability of deep learning networks. Likewise, the power of nonlinear devices, hierarchical computing structures, and optimization strategies that are central to deep learning methodologies will inspire new solutions to many problems in signal and image processing.
The new chapters that appear in these volumes offer the readers the means to keep track of some of the changes that take place in the respective areas.
We would like to thank the associate editors for their hard work in attracting top scientists to contribute chapters in very hot and timely research areas and, above all, the authors who contributed their time to write the chapters.
Section 1
Visual Information Processing
Chapter 1
Multiview video: Acquisition, processing, compression, and virtual view rendering
Olgierd Stankiewicz*; Gauthier Lafruit†; Marek Domański* * Poznań University of Technology, Poznań, Poland
† Brussels University (French wing), Brussels, Belgium
Abstract
This chapter presents an overview of the acquisition, processing, and rendering pipeline in a multiview video system, targeting the unique feature of rendering virtual views, not captured by the input camera feeds. This is called Virtual View Synthesis and supports Free Navigation similar to The Matrix bullet time effect that was popularized in the late 1990s. A substantial part of the chapter is devoted in explaining how to estimate the respective camera parameters and their relative positions in the system, as well as how to estimate/measure depth, which is an essential information in order to obtain smooth virtual view transitions with Depth Image-Based Rendering.
Keywords
DIBR—Depth Image-Based Rendering; View synthesis; Multiview plus depth
Acronyms
2D two-dimensional
3D three-dimensional
AVC advanced video coding
CAVE Cave Automatic Virtual Environment
DERS Depth Estimation Reference Software
DIBR depth image-based rendering
FTV Free viewpoint TV
HEVC high efficiency video coding
3D-HEVC high efficiency video coding with depth maps
HMD Head Mounted Device
IEC International Electrotechnical Commission
ISO International Organization for Standardization
ITU-T International Telecommunications Union-Telecommunication sector
JPEG Joint Photographic Experts Group
MPEG Moving Picture Experts Group
MVD multiview video plus depth
VCEG Video Coding Experts Group
VR Virtual Reality
VSRS View Synthesis Reference Software
Acknowledgments
This work was partially supported by the National Science Centre, Poland, within project OPUS according to the decision DEC-2012/05/B/ST7/01279, and within project PRELUDIUM according to the decision DEC-2012/07/N/ST6/02267, as well as project 3DLICORNEA funded by the Brussels Institute for Research and Innovation, Belgium, under Grant No. 2015-DS/R-39a/b/c/d.
The authors would also like to thank the fruitful discussions with members from the JPEG and MPEG standardization committees, and experienced colleagues having participated to the project DREAMSPACE funded by the European Commission.
1.1 Multiview Video
Ever since the success of three-dimensional (3D) games where the user selects his/her own viewpoint to a premodeled 3D scene, much interest has risen to be able to mimic this functionality also on real content. In 1999, the science-fiction movie The Matrix popularized the bullet time effect, showing a continuously changing camera viewpoint around the action scene. This was obtained by cleverly triggering and interpolating hundreds of camera views, cf. Fig. 1.1, which we refer as multiview video in the remainder of the chapter.
Fig. 1.1 Hundreds of cameras to capture The Matrix bullet effect. © Wikipedia.
Think of the possibilities if we could create such Free Navigation effect on the fly at low production cost, e.g., in sports events, where each individual TV viewer can choose his/her own viewpoint to the scene.
Moreover, synthesizing two adjacent viewpoints at any moment in time supports stereoscopic viewing, and closing the loop with a stereoscopic Head Mounted Device (HMD) that continuously renders the stereo viewpoints corresponding to the actual HMD position, brings Virtual Reality (VR) one step closer to a Teleported Reality,
also called Cinematic VR.
Such topics are actively explored in many audiovisual standardization committees. Joint Photographic Experts Group (JPEG) and Moving Picture Experts Group (MPEG), for instance, are heavily involved in getting new multimedia functionalities off the ground, far beyond simple two-dimensional (2D) pictures and video. More specifically, 3D data representations and coding formats are intensely studied to target a wide variety of applications [1], including VR.
1.1.1 Multiview Video and 3D Graphic Representation Formats for VR
The specificity of VR over single view video is that VR renders content from many user-requested viewpoints. This puts many challenges in modeling the data, especially when reaching out for photorealistic Cinematic VR.
A striking example of this challenge is the multiformat data representation shown in Fig. 1.2 [2], capturing actors (e.g., Haka dancers) with close to hundreds of multimodal cameras (RGB, IR, depth) and transforming this data successively into Point Clouds and 3D Meshes with 2D Textures. In this way, rendering under any perspective viewpoint becomes possible at relatively low cost at the client side, through the OpenGL/DirectX 3D rendering pipeline.
Fig. 1.2 From multimodal camera input (left) to Point Clouds and 3D meshes (middle) for any perspective viewpoint visualization (right). © Microsoft.
These successive transformations, however, involve heavy preprocessing steps, and if not sufficient care is taken in cleaning and modeling the data, the rendering might look synthetic with too sharp object silhouettes and unnatural shadings, as if they came out of a discolored old movie.
Other solutions that do not explicitly involve geometry information are a good alternative for synthesizing virtual views. Because they merely use the input RGB images and depth images, they are referred to as depth image-based rendering (DIBR) and are the main subject of this chapter.
1.1.2 Super-Multiview Video for 3D Light Field Displays
The feeling of immersion can also be obtained with large super-multiview (SMV) or Light Field displays instead of HMDs, cf. Fig. 1.3, that project slightly varying images in hundreds of adjacent directions, so creating a 3D Light Field with perceptively correct views. Ref. [3] interpolates a couple of dozens of camera feeds to create these hundreds of views. Anybody in front of the display will then be able to capture with his/her eyes the correct stereo pair of images, without wearing stereo glasses. In a sense, such Light Field display is the glasses-free Cave Automatic Virtual Environment (CAVE) equivalent of Cinematic VR with HMD.
Fig. 1.3 Holografika Super-MultiView 3D display at Brussels University: (left) clear perspective changes, and (right) a MPEG test sequence visualized. © Brussels University, VUB.
1.1.3 DIBR Smooth View Interpolation
Image-based rendering (IBR) solutions, solely using the input views and some image processing are more appealing for photorealistic rendering, especially in live transmission applications. Unfortunately, the solution is less simple than was presented in Fig. 1.1, since it involves much more than switching between input camera views.
Very challenging are the discontinuous jumps between adjacent views that have to be resolved with DIBR techniques, where depth plays a critical role in nonlinearly interpolating adjacent views. Figs. 1.4 [6] and 1.5 [7] (the reader is invited to use the links in the references for a video animation) show that very good results can be obtained with as little as a dozen of cameras cleverly positioned around the scene.
Fig. 1.4 Fencing view interpolation. © Poznan University of Technology; M. Domański, A. Dziembowski, A. Grzelka, D. Mieloch, O. Stankiewicz, K. Wegner, Multiview Test Video Sequences for Free Navigation Exploration Obtained Using Pairs of Cameras, ISO/IEC JTC 1/SC 29/WG 11 Doc. M38247, Geneva, Switzerland, 2016, http://mpeg.chiariglione.org/news/experiencing-new-points-view-free-navigation.
Fig. 1.5 Soccer view interpolation. © Hasselt University; P. Goorts, S. Maesen, M. Dumont, S. Rogmans, P. Bekaert, Free viewpoint video for soccer using histogram-based validity maps in plane sweeping, in: Proceedings of The International Conference on Computer Vision Theory and Applications (VISAPP), 2014, pp. 378–386, https://www.youtube.com/watch?v=6MzeXeavE1s.
1.1.4 Basic Principles of DIBR
View interpolation creates a continuous visual transition between adjacent discrete camera views by synthesizing spatial views in between them with DIBR. To better apprehend the process of this depth-based view interpolation, let us explain it with a simple experiment: the view interpolation of fingers from both hands. Let us place one finger of our left hand at arm length in front of our nose, and one finger of our right hand touching the tip of our nose, as shown in Fig. 1.6. And now, let us blink our eyes, alternating between left and right eye, keeping one eye open, while the other is closed.
Fig. 1.6 DIBR view interpolation basic principles.
What do we observe? When our left eye is open, we see our front right-hand finger a couple of centimeter more to the right of our rear left-hand finger. Conversely, when our right eye is open, we see our front right-hand finger a couple of centimeter more to the left of our rear left-hand finger. Our front finger, which is very close to our eyes, clearly undergoes a large displacement when alternating between our eyes, while our rear finger is hardly moving.
This phenomenon is called disparity, which by definition is the displacement from the left to the right camera view (our eyes) of a 3D point in space. Obviously, the disparity of any 3D point in space is inversely proportional to its depth: our rear left-hand finger with a large depth will have a much smaller disparity than our front right-hand finger. This phenomenon is exactly what will be exploited for depth-based view interpolation.
Indeed, suppose we would like to look to the scene from a third eye, virtually positioned more or less halfway between our left and right eye. Let us say that this virtual third eye is—starting from the left eye—positioned at 3/8 of the distance between left and right eye, cf. Fig. 1.7, which shows the left end right eye view of Fig. 1.6 combined.
Fig. 1.7 DIBR principle in synthesizing a new virtual view at fractional position α from two camera views.
If we want to synthesize the image that this virtual third eye would see, all we have to do is to displace all pixels we see in our right eye to the right with 5/8 (= 1 − 3/8) of its corresponding disparity δp, where the subscript p clearly indicates that the disparity is pixel p dependent.
In general, for the virtual third eye fractional position α (α = 0 at the left eye, and α = 1 at the right eye), the pixels we see in our right eye should be displaced with 1 − α their disparity δp to the right in order to create the corresponding virtual view. For α = 0 (the virtual view corresponds to the left view), all pixels of the right view will be displaced to the right with their disparity δp, recovering almost perfectly the left view.
Notice the subtle almost perfectly
in the previous sentence. Actually, most often, adjacent pixels at each side of an object silhouette will not be displaced with the same amount from left to right, since their depth and hence their disparity is likely to be very different. This results in unfilled, empty spaces around the objects’ silhouette of the virtual view, which is solved by:
• Not only displace the pixels from the right image to the virtual position α, but also displace the pixels from the left image to the virtual position α
• Blend the so-obtained images, drastically reducing the empty spaces
• The remaining empty cracks are filled with inpainting techniques, cf. Section 1.5.3, that may be compared to ink that is dropped at one border of the crack and diffuses its color to the other side of the crack (note that the diffusion process propagates in both directions, in a similar way as the aforementioned left-to-virtual or right-to-virtual image transformations are done)
More details about the DIBR steps are provided in Section 1.5.
1.1.5 DIBR vs. Point Clouds
Is DIBR on multiview video sequences the best approach to obtain nice Free Navigation effects? Not necessarily. As shown in Fig. 1.2 already, many different data formats can serve the purpose of Free Navigation. In particular, the Point Cloud data representation of Fig. 1.8 [8] provides all means to freely navigate into the scene. At large distances, cf. Fig. 1.8(A), the effect is very nice. However, at shorter viewing distances, cf. Fig. 1.8(C)-(E), blocky splatting effects with increased point sizes start to occur.
Fig. 1.8 London Science Museum Point Cloud (A) decomposed into individual clusters (B) taken from the position indicated by the empty circles. Splat rendering in (C) to (E). ScanLab projects Point Cloud data sets, http://grebjpeg.epfl.ch/jpeg_pc/index_galleries.html.
Fig. 1.9, gives a glimpse on the preprocessing and rendering quality differences between DIBR (left—use the link of the reference to see the animation [9]) and Point Clouds (right). Moreover, to obtain this high-density Point Cloud, many different acquisition positions of the Time-of-Flight Lidar scanner had to be covered, followed by an intense data preparation preprocessing steps [10]. DIBR techniques, on the contrary, restrict themselves to depth extraction and image warping operations to obtain an acceptable rendering without too many noticeable artifacts.
Fig. 1.9 DIBR perspective view synthesis (left) vs. Point Cloud rendering (right).
Nevertheless, Fig. 1.10 shows that DIBR and Point Cloud representations are mathematically very similar. In essence, in a Point Cloud representation, each 3D point in space captured by a Time-of-Flight active depth sensing device is given by its position (x,y,z) and the light colors it emits in all directions (referred to as a BRDF).
Fig. 1.10 Point Cloud data format (left) and its equivalence with DIBR (right).
Its complementary image-based representation will capture these light rays into RGB images, e.g., pixel (u,v) captures the light ray under angles (θ, φ) emanating from the point (x,y,z). Capturing multiple images from different viewpoints gives the possibility to estimate depth for each point (e.g., by stereo matching), adding depth and hence resulting in a fully corresponding DIBR representation of the Point Cloud.
The similarity between DIBR and Point Clouds will become more apparent in Section 1.5. More details about this discussion can also be found in the JPEG/MPEG JAhG (Joint Ad-Hoc Group) Light/Sound Fields technical output document [11], as well as in Ref. [1].
Though the underlying data formats are known to be mathematically equivalent between DIBR and Point Clouds, the compression and rendering qualities may differ and heavily depend on the preprocessing level of the acquired data. DIBR is the preferred technique proposed in the MPEG-FTV group for Free Navigation and SMV applications. We have therefore selected the DIBR technique to be presented in more details in the remainder of the chapter.
1.1.6 DIBR, Multiview Video, and MPEG Standardization
MPEG, which mission is to develop audiovisual compression standards for the industry, is currently actively involved in all aforementioned techniques: DIBR, Point Cloud, and 3D mesh coding. The interested reader is referred to Ref. [1] for an in-depth overview of use cases and 3D related data representation and coding formats.
In particular, in the MPEG terminology, the multicamera DIBR technology this chapter is devoted to, is referred to as multiview video. Multiview test sequences are made available to the community, helping in developing improved depth estimation, view synthesis, and compression techniques, cf., respectively, Sections 1.4–1.6. Some of the test sequences are presented in Fig. 1.11 and have been acquired with the acquisition systems which will be presented in Section 1.2.4.
Fig. 1.11 MPEG test sequences used in the DIBR exploration activities.
1.2 Multiview Video Acquisition
1.2.1 Multiview Fundamentals
The term multiview
is widely used to describe various systems. In this section we are dealing exclusively with the systems where the outputs from all cameras are processed jointly in order to produce some general representation of the scene. Therefore, for example, the video surveillance systems where video is registered and viewed from each camera independently are out of the scope of our considerations.
The goal of multiview video acquisition considered is to produce a 3D representation of a scene. Such 3D representation is needed to enable functionalities related to VR free navigation, e.g., to synthesize virtual views as seen from arbitrary positions in the scene. Currently, the most commonly used format of such representation is multiview video plus depth (MVD). In MVD many 2D (flat) images from cameras are amended with depth maps which bring additional information about the third dimension of a scene. Depth maps are matrices of values reflecting distances between the camera and points in the scene. Typically, depth maps are presented as gray-scale video, cf. Fig. 1.12, where the closer objects are marked in high intensity (light) and the far objects are marked in low intensity (dark).
Fig. 1.12 The original view (A) and corresponding depth map (B) of a single frame of Poznan Blocks
[ 104, 105] 3D video test sequence. In the depth map, the closer objects are marked in high intensity (light) and the far objects are marked in low intensity (dark). © Poznań University of Technology.
The particular meaning of the term depth
varies in literature and often is used interchangeably with others such as disparity, normalized disparity, z-value. In fact, in Fig. 1.12, the depth
represented by the gray-scale value of the pixels corresponds to disparity: the closer pixels have a higher value, which is in line with a higher disparity value. In order to understand the exact meanings of each of these terms depth
and disparity,
let us consider a pinhole model of a camera.
A pinhole camera is a simplified Camera obscura in which the lenses have been replaced with a single pinhole which acts as an infinitely small aperture. Light rays from the 3D scene pass through this single point and project an inverted image on the opposite side of the box, cf. Fig. 1.13A. In some formulations of the model of the pinhole camera, the fact that the projected image is inverted is often omitted by assuming that the image plane (virtual) is placed on the same side of the pinhole as the scene, cf. Fig. 1.13B. Therefore the aperture equation is:
Fig. 1.13 A visualization of a pinhole camera (A), its simplified model (B), and the corresponding coordinate system (C).
(1.1)
where f is the focal length (distance of the image plane along the z axis), u and v are coordinates on the projected image plane of some point with x, y, z coordinates in 3D space. In the mentioned equation the projected coordinate system UV is shifted so that it is centered on the principal point P, cf. Fig. 1.13C. In such a case the perspective projection is as follows:
(1.2)
Using such a pinhole camera model, we can consider object F, cf. Fig. 1.14, observed from two cameras (left and right). In each of views, the given object is seen from a different angle and position, cf. Fig. 1.14, and therefore its observed positions on image planes are different (FL in the left image and FR in the right image). With only a single view (e.g., the left one) we cannot tell what is the distance of the object observed at position FL. For the real position F, the object can be thought as being in position of G, I, or J, which in the right view would be observed in positions FR, GR, IR, and JR, respectively. The set of all potential 3D positions of observed object EL, projected onto the other image plane, is geometrically inclined to lie along a so-called epipolar line.
Fig. 1.14 Epipolar line as a projection to the right view of all potential position of object residing on ray ρ , seen in the left view as a single point.
An epipolar line is a projection of a ray, pointing from an optical center of one camera to a 3D point, to the image plane of another view. For example, in Fig. 1.14, a ray ρ (starting in OL and passing through F) is seen by the left camera as a point because it is directly in line with that camera's projection. However, the right camera sees this ray ρ in its image plane as a line. Such projected line in the view of the right camera is called an epipolar line. An epipolar line marks potential corresponding points in the right view for pixel F in the left view.
Positions and orientations of epipolar lines are indicated by the locations of the cameras; their orientations; and other parameters, such as focal length, angle of view, etc. Typically, all of those parameters are gathered in the form of intrinsic and extrinsic camera parameter matrices, cf. Section 1.3.1. In a general case, epipolar lines may lie along arbitrary angles, cf. Fig. 1.16 (top).
A common and important case for considering depth is the linear arrangement of the cameras with parallel axes of the viewpoints, cf. Fig. 1.15. Such setup can be obtained both by precise physical positioning of the cameras or by postprocessing, called rectification, cf. Fig. 1.16 (bottom).
Fig. 1.15 Linear arrangement of the cameras with parallel axes of the viewpoints.
Fig. 1.16 Original (top) and rectified (bottom) images from Poznan Blocks
[ 104, 105] multiview video sequence for some selected pair of views. Exemplary epipolar line for ray ρ has been marked with a solid line. © Poznań University of Technology.
In such a linear arranged case, the image planes of all views coincide, cf. Fig. 1.16 (bottom), and the epipolar lines are all aligned with the horizontal axes of the images. Therefore the differences in observed positions of objects become disparities along horizontal rows. Due to the projective nature of such video system with linear camera arrangements, the further the object is from the camera, the closer is its projection to the center of the image plane.
1.2.2 Depth in Stereo and Multiview Video
Let us now define various terms related to the depth, using the camera geometry of Fig. 1.17.
Fig. 1.17 Exemplary objects E, F, and G—projected onto image planes of two cameras.
For a given camera (e.g., the left one in Fig. 1.14) the distance between the optical center (OL) and the object (e.g., represented by a single pixel) along the z axis is called the z-value. The unit of the z-value is the same as the distance units in the scene. These may be, e.g., meters or inches, and depends on the scale of the represented scene.
The distance between observed positions of object F, with FL in the left image and FR in the right image, is called disparity:
(1.3)
In a general case, the disparity is a vector lying along the respective epipolar lines. In case of linear rectified camera arrangements, disparity vectors lie along horizontal lines.
The z-value and the disparity between two views are mathematically related and can be derived if the positions and models of the cameras are known. For a stereoscopic pair of cameras, horizontally aligned (or rectified), distant by the length of baseline B, the z-value ZF of point F can be calculated as follows:
(1.4)
where ZF is the distance along z-axis of point F from the camera’s optical center, B is the baseline distance between the pair of cameras, and ΔF is the disparity of point F.
Obviously, a similar but more complex relation between the z-value and the disparity can be found for other arrangements of camera pairs.
The term depth,
depending on the context, may refer to different meanings, e.g., it is often used interchangeably with disparity and with z-value distance, or sometimes as a generalized term, describing both.
Usually, in literature term depth
is just used to refer to a data stored in depth maps, e.g., in the form of images or files on disk. For example, in OpenEXR Blender files [12] the depth is stored in the form of z-values represented as 32-bit floating point number (IEEE-754 [13]).
The most common format of depth value representation is an 8-bit integer representing, the so-called normalized disparity. The disparity is then normalized so that the represented range (0–255) covers the range of disparities from Δmin to Δmax, corresponding to z-values from zfar to znear, respectively. Such a format combines the advantages of representing depth as a disparity or as a z-value. The representation of depth as disparity better suits the human visual system (foreground is represented at a finer quantization than the background), while disparity can be almost directly used for view synthesis and is actually a direct product of depth estimation algorithms. The representation of depth as disparity actually depends on the considered camera arrangement (e.g., disparity is proportional to baseline B).
The representation of depth as z-value does not have such disadvantage. The same goes for normalized disparity, because the disparity values are normalized with respect to the zfar to znear scale.
The depth defined as normalized disparity can be calculated from the following equation:
(1.5)
(1.6)
where dmax is the maximal value for a given integer binary representation, e.g., 255 for 8-bit integers, or 65,535 for 16-bit integers.
The selected depth range (e.g., Δmin to Δmax, zfar to znear) along with the bit width of integers used together affect the precision with which the depth is represented. For a given bit width, e.g., 8-bit, the broader the selected depth range, the worse is its precision. On the other hand, enforcing sufficient precision narrows the depth range which at some point may disallow the representation of some objects in the scene. For sophisticated scenes, with nonlinear camera arrangements, it is often impossible to meet both requirements at the same time and usage of wider integer range (e.g., 16-bit) becomes mandatory.
Apart from the depth precision, which is an attribute of the representation format, another important factor is the depth accuracy which results from the method of depth acquisition. The depth can be acquired with the use of specialized equipment, like depth sensors, or by means of algorithmic estimation, as explained in Section 1.4.
In the case of depth sensors, the achievable depth accuracy results from the used technology. For example, for Microsoft Kinect [14], cf. Fig. 1.18 (top), the depth error is from 2 mm for close objects to 7 cm for far objects [14]. For the Mesa Imaging SR4000 [15] depth sensor, cf. Fig. 1.18 (bottom), the accuracy is about 15 mm. It must be noted, however, that the accuracy of depth sensors depends on factors like reflectivity of objects in the scene, distance of the scene, etc.
Fig. 1.18 Photography of Microsoft Kinect depth sensor (left) and Mesa Imaging SR4000 depth sensor (right).
In the case of depth estimation algorithms, like the ones in Section 1.4, the accuracy is strongly influenced by the resolution of the analyzed images. The depth is estimated by finding image correspondences between the views, and thus the direct outcome of such is disparity. The measurement of distances in the images is limited by their resolutions and thus the higher the resolution, the more accurate the depth is [16]. Also it is crucial that the image