T3 Parte 1 SSMM

Grupo de Tratamiento de Imágenes Universidad Politécnica de Madrid
Multimedia Systems and Services
T3: Compression of Audiovisual Signals
Introduction and fundamental concepts
Luis Salgado
L.Salgado@gti.ssr.upm.es
SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 1

Contents
1. Introduction
2. Digital representation
- Speech/Audio
- Images
- Video
3. Compression: needs, goals and requirements
4. Source coding

Introduction
 Sources of information:
 Audio/speech
 Image
 Video
 2D/3D Graphics

Introduction
 Sampling  discrete time signal

 Sampling frequency: fs
 Nyquist sampling theorem: fs ≧ 2 x fmax
 Sets the minimum sampling rate for perfect reconstruction
 Quantization  digital signal

 Finite number of possible values per sample
 Source coding:
 Quantization intervals represented by symbols from a finite alphabet

Contents
1. Introduction
- Speech/Audio
- Images
- Video
4. Source coding

Digital representation Speech/Audio
 Components between 20 – 20.000 Hz

 Limit of human hearing: ~ 20.000 Hz
 Uncompressed Speech
 Narrowband telephony: Pulse Code Modulation (UIT-T G.711)
 Non-uniform robust quantization: A-Law (log PCM)
 256 quantization intervals  8 bit/sample
 Wideband and extensions (UIT-T G.711.1)
freq. range fs bits per bitrate

(Hz) (kHz) sample (kb/s)
300 – 3.400 8 8 64 Narrowband telephony
< 8.000 16 14 224 Wideband speech

Digital representation Speech/Audio
 Uncompressed Audio:
 Different qualities depend on application
 Uniform quantization
fs bits per # bitrate

(kHz) sample channels (kb/s)
32 16 2 1.024 DATs, broadcast (still….)
44,1 16 2 1.411,2 CD, DAT – consumer
48 16 2 1.536 DAT, DVD-Video – professional
96 16/20/24 2 4.608 Digital recording software/hardware

Aliasing
 Sampling  signal spectrum replication

 High frequencies overlap  aliasing
Real Music
48 kHz
8 kHz – alias
8 kHz – LPF
© Mark Handley

Contents
1. Introduction
- Speech/Audio
- Images
- Video
4. Source coding

Digital representation of Images
 2D signals
 Sampled at specific spatial locations
 Quantized signal values
 Pixel: minimum unit
 Aspect ratio: width/height
Source: “Digital Image Processing”, R. Gonzalez and Goods, Prentice Hall, 2002

 Spatial resolution: “dots per inch”, (dpi)
 > spatial resolution  > definition
 Signal frequency defined in space
 Depth: “bits per pixel”, (bpp)
 Bits used to represent each pixel information
 Indicates the degree of compression
Source: hp.com.
72 dpi (Web) 300 dpi


 Intuitive Nyquist
Spatial resolution Sampling points
Original Image
Sampled
Image
Details are missed


 Intuitive Nyquist
Sampled image
Original Image
1mm
2mm
Signal period Spatial resolution No details missed

(sampling rate)
Nyquist:
spatial resolution ≤ ½ x minimum signal period or
sampling frequency fs ≥ 2 x maximum frequency in the image
Digital representation of images

 RGB:
 Uniform quantization of each color plane
8 bits
 True-color: 24 bpp
 Others: YCrCb, HSV…

 Indexed color:
 Limited set of colors
 Pixel values are index to a Color Look-up
Table (CLUT)
Source: An introduction to Digital Image Processing with MATLAB, A. McAndrew, Victoria University of Technology, 2004.

RGB
512x512x3x8
~ 6.29 Mbit
Luminance and color

 Human vision ranges: scotopic, mesopic, photopic.
 Retinal photoreceptors
 Rods: large number (~100M), scotopic, no color, low illumination (night).
 Cones: lower number (~6M), photopic, color, high illumination (day).
 Tri-receptor theory of color vision (1802)
 Photoreceptors (cones) render three values of α
a1   
a3  
 i (C )   C ( )  ai ( )d  i  1, 2,3 a2  
Source: “Digital Image Processing”, R. Gonzalez and Goods, Prentice Hall, 2002

Luminance and color
Y   C ( )  aY ( )d  aY ( )  a1 ( )  a2 ( )  a3 ( )
aY   
a2   
a1     20
a3   
Fuente: “Video processing and communications”, Y. Wang, Prentice Hall, 2002

Luminance and Crominances: YCrCb

Color sensitivity

Subsampling color
4:2:2
~ 4.19 Mbit
4:2:0
~ 3.14 Mbit

Aliasing
 Staircasing
 Interference
Source: “epigrammedia.com” Source: “directorfotografia.com”
Source: “www.svi.nl”

Aliasing
 Moiré patterns
Fuente: “en.wikipedia.org”

Digital representation Images
 Uncompressed Images:
 Resolution related to sensor
 Uniform quantization of color planes
Width Height # bits Mbit

(pixels) (pixels) planes per plane
2688 1520 3 8 98,058 HTC One (M8) – 4 Mpix

3264 2448 3 8 191,77 Apple iPhone 5s – 8 Mpix
4320 2432 3 8 252,15 Motorola X – 10 Mpix
4160 3120 3 8 311,50 LG G2 – 13 Mpix
4992 3744 3 8 448,56 Nokia Lumia 1520 – 20 Mpix

Contents
1. Introduction
- Speech/Audio
- Images
- Video
4. Source coding
HVS – Frequency
Response

Video signal
 Analog:
 Represented as a continuous (time varying) 1-D signal
 Sampling done in space (rows) and time  Scanning
 Digital
 Sequence of digital images: set of quantized samples coded
 Keeps quality through regeneration
 Explodes the processing and manipulation capabilities
 Scanning:
 Periodic sampling of the light information at the camera
 Records information about the light distribution along predefined
sampling lines
 Includes control information: synchronization pulses

Scanning
 Progressive scanning
 Generates a complete image, frame, sampling consecutive lines
 Frame rate: images/sec
 To avoid flicker: typically 25/30 images/sec (PAL/NTSC)
 Interlaced scanning
 1 every N lines from the complete image at each scan
 Interlace ratio N:1. N=2 typically.
 The whole image, frame, is composed by 2 fields (half images)

Analog video main parameters

Main parameters PAL NTSC
Number of lines in the image [lines/frame] 625 525
Line scanning interval or line period 64 μs 63.555 μs
Line blanking or horizontal blanking interval 12.05 μs 10.90 μs
Interlace ratio 2:1 2:1

Field blanking or vertical blanking interval 25 lines/field 21 lines/field
20 μs 16.6833 μs
Frame rate 50 Hz 59.94 Hz
 Bandwidth
 Computed for the worst case: B = 7.38 MHz estimated for TV PAL
 Perceptual and display resolution limitations allow reducing B
 Bedford y Kell estimate a 30% reduction  B’=5.5 MHz

Digital video signal
Composite Video
Analog
TV & Analog
Formating
Broadcasting
PAL/NTSC/SECAM
TV/Video
camera RGB Signals
Digital Video BT601 / 709 / 2020
TV / HDTV / UHDTV Studio
Digital
Formating MPEG-X
/HEVC
Vídeo DTV &
Conversión A/D
Compression Digital
Broadcasting
Lower bit-rates

Video digitization
 Raster signal sampling  line sampling (horizontal dimension)

 Sampling rate:
 Samples vertically aligned
 Horizontal sampling interval = Vertical sampling interval
 Common for different systems
fl (NTSC) = 15.734 kHz =525 lines/frame x 30 frames/s = 15.75 kHz (525*29,94 = 15,7185)
fl (PAL) = 15.625 kHz  625 lines/frame x 25 frames/s = 15.625 kHz
f s  858 fl (NTSC)  864 fl (PAL)  13.5 MHz

Digital video digitization/representation

 ITU-R BT.601: Studio encoding parameters of digital television for standard 4:3 and
wide-screen 16:9 aspect ratios
858 pels 864 pels

720 pels 720 pels
525 lines
480 lines
625 lines
576 lines
Active Active
Area Area
122 16 132 12
pel pel pel pel
NTSC 525/60: 60 field/s PAL 625/50: 50 field/s
 ITU-R BT.709: Parameter values for the HDTV standards for production and
international programme exchange
 ITU-R BT.2020: Parameter values for ultra-high definition television systems for
production and international programme exchange
Luminance and Crominances


 Non gamma corrected components [0,1]: R  E R G  EG B  E B
* Could be already digital signals [0,255] normalized
 Luminance [0,1]: EY  k R  E R  (1  k R  k B )  EG  k B  E B

kR/kB : R/B contributions to luminance
( ER  EY ) ( E B  EY )
 Normalized color differences [-0.5,0.5]: EC  EC 
R
2  (1  k R ) B
2  (1  k B )
BT. 601 BT.709 BT.2020

kR = 0.299; kB = 0.114 kR = 0.2126 kB = 0.0722 kR = 0.2627 kB = 0.0593
EY  0.299 E R  0.587 EG  0.114 E B 0.2126 E R  0.7152 EG  0.0722 E B 0.2627 E R  0.6780 EG  0.0593 E B
Quantization
D=1 / 4  8/10 bits n=8/10 bits n=10/12 bits
Y  int  219  EY  16   D / D Y  int  219  EY  16   2n 8 
  
CR  int 224  EC' R  128  D / D 
CR  int 224  EC' R  128  2n 8  
CB  int  224  E '
CB  128   D / D CB  int  224  E'
CB  128   2  n 8

Chrominance sampling
 Structure
 Spatially static and orthogonal
 Samples of components should be co-sited
 Repeated for each line, field and frame
 Sampling hierarchy
 Sampling families represented by three values that identify the
sampling frequency for each component (… initially!!)
 Max. sampling frequency identified with 4
 fs = 13.5 MHz in BT.601 (18 MHz also considered)

Chrominance sampling
4:4:4 4:2:2 4:1:1 4:2:0

For every 2x2 Y Pixels For every 2x2 Y Pixels For every 4x1 Y Pixels For every 2x2 Y Pixels
4 Cb & 4 Cr Pixel 2 Cb & 2 Cr Pixel 1 Cb & 1 Cr Pixel 1 Cb & 1 Cr Pixel
(No subsampling) (Subsampling by 2:1 (Subsampling by 4:1 (Subsampling by 2:1 both
horizontally only) horizontally only) horizontally and vertically)
Y Pixel Cb and Cr Pixel
 Adequate filtering always required before subsampling

 4:4:4 RGB or YCrCb
 4:2:0 Different implementations differ on chroma location
 How to be implemented in interlaced video????
Some data rates….

Video Format Size Color Frame Rate Raw Data Rate
Sampling (Hz) (Mbps)
UHD Production and program exchange
BT.2020 – 8K 7680x4320 4:4:4/4:2:2/4:2:0 25P/60P 12441/29859
BT.2020 – 4K 3840x2160 4:4:4/4:2:2/4:2:0 25P/60P 3110/7464
HDTV Production and program exchange
BT.709 1920x1080 4:2:2 24P/30P/60I 796/995/995
HDTV Over air. cable, satellite, MPEG2 video, 20-45 Mbps
SMPTE295M 1920x1080 4:2:0 24P/30P/60I 597/746/746
SMPTE296M 1280x720 4:2:0 24P/30P/60P 265/332/664
Video production
BT.601 720x480/576 4:4:4 60I/50I 249
BT.601 720x480/576 4:2:2 60I/50I 166
High quality video distribution (DVD, SDTV)
BT.601 720x480/576 4:2:0 60I/50I 124
Intermediate quality video distribution (VCD, WWW)
SIF 352x240/288 4:2:0 30P/25P 30
Video conferencing over ISDN/Internet
CIF 352x288 4:2:0 30P 37
Video telephony over wired/wireless modem
QCIF 176x144 4:2:0 30P 9.1

Contents
1. Introduction
- Speech/Audio
- Images
- Video
4. Source coding

Need of compression
 Image: 8.0 million pixel camera (iPhone), 3264x2448
 25 MByte/image  41 pictures / 1GB
 Video:
 video 720x576, RGB, 25 frames/s  31.1MByte/sec
 audio 16bits x 44.1KHz stereo  176.4 KByte/s
 DVD Disc 4.7 GB  ~ 2.5 min per DVD disc
 RGE-1 Network (TDT Multiplex): 19.91 Mbps  Not 1 STV channel|||
 Send video from cellphone:

 352*288, RGB, 15 frames/s  4.56 MByte/sec
 Bandwidth  Cost

Goals of compression
 ↓ ↓ Redundancy: …exceeds what is necessary or required…
 Symbol redundancy: take advantage of symbol probabilities
 Spatial and temporal redundancy
 Adjacent samples are highly correlated
 In video, co-situated samples in different images are correlated
 ↓ ↓ Irrelevance or perceptual redundancy

 Depends on perceptual limitations: build a perceptual model
 Reduce that information not to be perceived
 Represent the same information with fewer bits…

 Lossless : preserve ALL information, perfectly recoverable
 Lossy: eliminate that information perceptually insignificant 
original signal can not be recovered

Perceptual redundancy
 Exploited through the quantization of the audiovisual info

 Quantization intervals adapted to audio/visual system sensitivity
 Smaller intervals for higher sensitive information
 Larger intervals for information less perceptually relevant
 Signal should be “represented” in an alternative way (space) in

which:
 Relevance of the info. is highlighted from perceptual point of view
 If possible, shows less correlation between samples than that in the
original space  introduces signal decorrelation
 Perceptual models are defined in the frequency domain:

 Signal transformations are to be used: subband decomposition, FFT,
DCT…

Requirements for compression algorithms

 Lossless
 Decoded signal is mathematically equivalent to the original one
 Drawback: achieves only a small or modest level of compression
 Lossy
 Decoded signal is of a lower quality than the original one
 Advantage: achieves very high degree of compression
 Objective:
 Maximize the degree of compression with a certain quality or
 Maximize the quality with a certain degree of compression
 General compression requirements

 Ensure good quality of decoded signal with high compression ratios
 Minimize the complexity of the encoding and decoding process
 Support multiple channels and various data rates
 Give small delay processing

Fidelity Criteria for quality evaluation

 Used to measure the signal quality
 In compression, to evaluate impact of losing real or quantitative signal
information
 Subjective criteria:
 Require the definition of a grading scale for qualitative evaluation
 Require standardized testing protocols involving a relevant sample of
people  difficult to implement
 It is the best approach to compare if the target is to generate high quality
compressed signals according to human perception
 Objective criteria:
 Evaluate the similarity between two signals (for example images), one of
them taken as reference, through a mathematical function.
 In compression indicates the loss of information between the input and
output signals of the compression process.
 Not always correlated with subjective evaluation results!!


 Mean Square Error (MSE)
 Example: for an image I (x,y) of size MxN:
1 M N
 2
mse

MN
 [ I ( x , y )  
I ( x , y )]2
x 1 y 1
 SNR and PSNR (dB):
2
SNR  10 log 2
 mse 2
max theoretical (I )
PSNR  10 log
1 M N  mse
2
 2

MN
 [ I ( x , y )]2
x 1 y 1


 Example for image compression
a) b) c)
 Objective:
SNR-ab=11.35
SNR-ac=11.69
 Subjective:
 b: passable
 c: marginal?

Contents
1. Introduction
- Speech/Audio
- Images
- Video
4. Source coding

Source coding scheme

x[n] z[n] zq[n]
Transformation Quantization Coding
Block of n Transform Quantized bit
samples coeficients coeficients patterns
 Transformation:
 Alternative representation of the signal
 Helps to remove redundancy and highlight perceptual relevance
 Reversible: typically no loss of information
 Quantization:
 Adapted to signal redundancy and perceptual relevance
 loss of information!  not recoverable the original signal
 Coding (VLC: Variable Length Coding):
 Input data (symbols) are transformed into codewords
 Removes input data redundancy: symbol probabilities
 entropy coding
VLC Coding
 Ignores semantics of input data and compresses media streams

by regarding them as sequences of digits or symbols
 Examples: run length coding, Huffman coding, …
 Desired properties of symbol codes:
 Non-singular: every symbol xi in X maps to a different codeword
 Uniquely decodable: every sequence {x1, x2, …,xn} maps to a different
codeword sequence
 Instantaneous: no codeword is a prefix of any other codeword
uniquely decodable
non-singular instantaneous
VLC Coding
Examples

Transformation: Prediction
 Signals are correlated  use previous samples to predict
 Encode the prediction error (difference between the signal value and its
prediction) lowers the bitrate
 Prediction gain: for the same number of bits per sample, the use of
prediction renders a gain in the signal-distortion rate
x[n] + e [ n] eq [n]
Coder
Σ Quantizer
xˆ[n] +
Predictor x[n]
with delays Σ
+
eq [n] + x[n]
Σ
Decoder +
xˆ[n]
Predictor
xˆ[n]    i x[n  i ] with delays
i

Transformation: frequency domain
 Target:
 Decompose the original signal into “sub-signals”, each corresponding to
different frequency bands
 Generate a new signal in which the original signal energy is distributed
among a reduced set of samples (coefficients)
 energy efficiently packed
 Reversible process  No information is lost

 Compression: Not achieved directly
 What for?
 Apply perceptual models for coding:
 Separating irrelevance…
 Reducing redundancy…

Subband decomposition
 Filter Bank that isolate parts of the signal that correspond to

different frequency ranges.
 Analysis Filters + decimation  subband signals generation
 Upsampling + Synthesis Filters  signal reconstruction
 Perfect reconstruction possible: depends on filters used
 No compression is achieved
Source: http://zone.ni.com

Discrete Cosine Transform (DCT)
 Linear transformation used for audio and image/video

compression
 Fast algorithms to compute exist based on DFT (FFT)
 Real valued (integer DCT is used)
 Preserves energy and energy packing (signal decorrelation) is close to that
of the optimum transform
 Signal is represented in an alternative space as:

 A sequence of numbers (1-D DCT)
 A matrix of numbers (2-D DCT)
 Each number represents the amount of a certain frequency pattern hold in
the original signal

1-D DCT
 The DCT of a sequence of N samples of a signal x[n] is a
sequence of N coefficients C[u]
 “n” refers to the temporal axis
 “u” refers to the frequency axis
 The transformation consists on representing x[n] as a linear
combination of N base functions of cosine form
2 (2n  1)u
Base functions: Fb (u , n)  K (u ) cos
N 2N
2 N 1
(2n  1)u
C[u ]  K (u ) x[n] cos para u  0,1, ( N  1)
N n 0 2N
N 1
1 / 2 si u  0
con K (u )  C[u ]   x[n] Fb (u , n)
 1 si u  0 n 0

Base functions for N=8

u=0 u=1 u=2 u=3
n n n n
u=4 u=5 u=6 u=7
n n n n
 8 base functions (u=0...7) each of 8 samples (n=0...7)

 Represent signals of increasing frequency contents:
 u=0 represents the DC component: u=0
 u is the frequency axis, N functions of n (time axis)

DCT example
C[0]
x[n] DCT C[u]
C[1]

n u
C[2]
u=0
x
u=1 x x u=2
...
n n
n
n
+ + + ...
n n
Most part of the energy is concentrated in the first coefficients!!!

I-DCT example
 Considering only the three first coefficients

 Those that keep most part of the energy
 Quantizing to 0 the other 5 coefficients
 Reconstruct the signal using the Inverse DCT (I-DCT)
C[0] Cr[u] I-DCT xr[n] Reconstructed signal

DCT quantized

C[1] u
C[2] n
Somehow “Smoothed”
x[n] Original signal

Reconstruction error
n n
Typically low for most common signals!!!!

Credits
 Some contents has been adapted from that originally generated by

Enrique Rendón Angulo (EUITT-UPM) , Juan Carlos San Miguel, Jesús
Bescós and José María Martínez Sánchez (EPS-UAM).

T3 Parte 1 SSMM

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

T3 Parte 1 SSMM

Încărcat de

Drepturi de autor:

Formate disponibile

Grupo de Tratamiento de Imágenes Universidad Politécnica de Madrid

Multimedia Systems and Services

T3: Compression of Audiovisual Signals

Introduction and fundamental concepts

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 1

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 2

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 3

 Sampling  discrete time signal

 Quantization  digital signal

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 4

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 5

Digital representation Speech/Audio

 Components between 20 – 20.000 Hz

freq. range fs bits per bitrate

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 6

Digital representation Speech/Audio

fs bits per # bitrate

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 7

 Sampling  signal spectrum replication

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 8

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 9

Digital representation of Images

Digital representation of Images

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 11

Digital representation of Images

Details are missed

Digital representation of Images

Signal period Spatial resolution No details missed

Digital representation of images

 Others: YCrCb, HSV…

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 14

Luminance and color

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 16

Luminance and color

Fuente: “Video processing and communications”, Y. Wang, Prentice Hall, 2002

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 17

Luminance and Crominances: YCrCb

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 18

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 19

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 20

Source: “epigrammedia.com” Source: “directorfotografia.com”

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 21

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 22

Digital representation Images

Width Height # bits Mbit

2688 1520 3 8 98,058 HTC One (M8) – 4 Mpix

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 23

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 24

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 25

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 26

Analog video main parameters

Interlace ratio 2:1 2:1

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 27

Digital video signal

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 28

 Raster signal sampling  line sampling (horizontal dimension)

SSMM @ ETSIT-UPM Compression of Audiovisual Signals – Fundamentals – 29

Digital video digitization/representation

858 pels 864 pels

NTSC 525/60: 60 field/s PAL 625/50: 50 field/s

Luminance and Crominances

 Luminance [0,1]: EY  k R  E R  (1  k R  k B )  EG  k B  E B

BT. 601 BT.709 BT.2020