Sunteți pe pagina 1din 3

Related Works:

[1] https://arxiv.org/pdf/1501.00092.pdf
[2] https://arxiv.org/pdf/1506.07552v2.pdf
[3] http://www.image-net.org/
[4] http://ivpl.ece.northwestern.edu/sites/default/files/07444187.pdf
[5] https://people.csail.mit.edu/celiu/pdfs/VideoSR.pdf

Overview:
We use example-based super-resolution (SR) techniques to enhance the
resolution of a video. Specifically, we focus on parallelizing computation such
that the processing time for each frame is ~1/60 seconds (i.e. 1 second per
60 frames). Current state-of-the-art in video SR has seen various degrees of
success, with processing times ranging from .24 seconds per frame[4] to
over a minute [5]. Reducing this time below the frame rate of a video is
crucial to applications which involve the live-stream of videos.
CNN: In this paper, we implement video SR via convolutional neural
networks, a machine learning algorithm that has proven to be quite effective
in image enhancement [1]. We treat the CNN as a black box in that the
network parameters are already known. The processing time will be a
function of the complexity of the network (i.e. more parameters = longer to
process). The CNN itself operates by mapping functions of input data to
desired outputs. In our case, the CNN seeks to determine F such that:
F ( X ) =Y
Where X is the low resolution version of the image and Y is the high
resolution image.
General CNN Architecture: Let the input low resolution (LR) image have size
M N with 3 channels. First layer processing in the CNN yields a result:
F1 ( X )=F ( W 1X +B 1 )
W1

B1

are the weights and biases and denotes the


convolution operation. W 1 has dimensions f 1 f 1 . The function F is
Where

and

some activation function (for example, sigmoid, max, softmax etc.). The
second layer of the CNN follows the same form:
F2 ( X )=F ( W 2F 1 +B 2 )
More layers can be added to the network. The output of the
given by:

n th layer is

Fn ( X ) =F ( W nFn 1 + Bn )

where

n>1

The final layer of the network is:


F ( X ) =W k F k1 + Bk
Where k is the number of layers in the CNN. Thus, the final layer produces
an image that approximates the ground-truth version of the image. The
parameters we must find are:
{ W 1 , ,W k , B1 , , Bk }
This is found by minimizing the function:
M
k
k
2
1
E= |F i ( X ) Y i| + 1 W 2i + 2 B2i
n i=1
2 i=1
2 i=1
Processing a video: From the weights and biases determined above we can
use them to process a low-resolution image or video file. We aim for
accuracy and efficiency and thus process the video in several ways:
(1)Processing the whole video in parallel. Individual frames are processed
separately on
separate nodes.
Fram
e
Vide
o

CNN

Fram
e

CNN

Fram
e

CNN

HR
Vide

Each frame is processed on a separate node. We use an Amazon EC2


cluster with Spark to implement the process. Processing time per
frame is calculated by:
Total Processing Time
Processing Time per Frame=
Number of Frames
The accuracy is determined by the PSNR metric:
2
1
255 2
MSE=
|f ( X )Y |
PSNR=10 log

where
MN
MSE

( )

Where

is the ground-truth version of each frame and

f (X )

is

the output frame from the above process.


(2)Processing one frame at a time. Each frame is split into smaller subframes and each sub-frame is sent to separate nodes for processing.
The processed sub-images are concatenated together to produce the
final result.

(3)We experiment with overlapping sub-images from a single frame. This


method is the same as (2) except sub-frames are overlapping to
compensate for image reduction when passing through the CNN.
Obtaining LR Video: Let the ground-truth image have dimensions M N
with 3 channels. The LR video is obtained by first, scaling the video down
1/ n where n>1 . Then, bicubic interpolation is performed to scale the
video back up to dimension M N . Next, Gaussian Blur is added to the
video. Finally, artificial noise is introduced. The process can be summarized
by the equation:
Y =W X +
Where

is the filter corresponding to


Gaussian blur multiplied by the filter for bicubic interpolation. X denotes
the down-sampled video and Y is the LR output.

Video

is the random noise and

Scale Down

Bicubic
Interpolatio
n

Gaussian
Blur

Random
Noise

LR Video

Zero-padding the inputs: Because the CNN we are using [1] produces an
output image smaller than the input, we must zero-pad each frame before
enhancing its resolution. The size of the zero-padded border will depend on
the parameters of the neural network. For method (2) we must zero-pad
each sub-frame. For method (3) we can either zero-pad each sub-frame or
only along the outer edge of the entire frame. Zero-padding adds extra
computation time that must be accounted for.
Plan:
1. Set up Amazon EC2 node clusters. Convert a video to LR.
2. Use Spark to enhance the video using the 3 methods listed above.
Record accuracy and processing time for each.
3. Repeat for other videos. Compare results to previous work.

S-ar putea să vă placă și