Multimedia Analysis Techniques For E-Lea

172 Int. J. Learning Technology, Vol. 7, No.
2, 2012
Multimedia analysis techniques for e-learning
Dimitris N. Kanellopoulos
Educational Software Development Laboratory,
Department of Mathematics,
University of Patras,
University Campus, GR 26500, Rio, Patras, Greece
E-mail: d_kan2006@yahoo.gr
Abstract: Multimedia analysis techniques can enable e-learning systems and

applications to understand multimedia content automatically. Therefore, such
techniques can provide various novel services to both e-learning video
providers and learners. This paper aims at providing lecturers and learners with
a state-of-the-art overview of multimedia analysis techniques used mainly in
internet video for e-learning purposes. Finally, the paper presents some of the
open issues for further research in the area of video analysis for e-learning.
Keywords: multimedia; video analysis; learning with videos; internet video.
Reference to this paper should be made as follows: Kanellopoulos, D.N.

(2012) ‘Multimedia analysis techniques for e-learning’, Int. J. Learning
Technology, Vol. 7, No. 2, pp.172–191.
Biographical notes: Dimitris N. Kanellopoulos holds a PhD in Multimedia

Communication from the Department of Electrical and Computer Engineering
of the University of Patras, Greece. He is a member of the Educational
Software Development Laboratory in the Department of Mathematics at the
University of Patras. His research interests include multimedia communication,
intelligent information systems, knowledge representation, and web
engineering. He has authored many papers in international journals and
conferences in these areas. He serves as a member of the editorial boards for
several academic journals.
1 Introduction
In the last decade, e-learning has become a hopeful alternative to the traditional
classroom learning. In addition, e-learning is helping society move toward a vision of
lifelong and on-demand learning (Zhang et al., 2004). Recent advances in multimedia and
communication technologies have resulted in powerful learning systems with
instructional video components. Video is a rich and powerful medium, as it can present
information in an attractive and consistent manner (Corporation for Public Broadcasting,
2004). The emergence of non-linear, interactive digital video technology allows learners
to interact with instructional video, while they experience videos in their leisure time
(Zhang et al., 2006). A major ‘media attribute’ of interactive video is random access to
video content. This means users or learners can select or play a segment with minimal
search time. Interactive video allows proactive and random access to video content. This
enhances learner’s engagement, and thus it improves learning effectiveness.
Copyright © 2012 Inderscience Enterprises Ltd.

Multimedia analysis techniques for e-learning 173
Prior studies have investigated the effect of instructional video on learning outcomes
(Sorensen and Baylen, 1999). Merkt et al. (2011) examined learning with videos versus
learning with print. They examined the role of interactive features using learning with
videos. For their study, they used two videos affording different degrees of interactivity
and a content-equivalent illustrated textbook. They found that features enabling
micro-level activities, such as stopping the video or browsing, seemed to be more
beneficial for learning than features enabling macro-level activities, such as referring to a
table of contents or an index. Based on another empirical study, Zhang et al. (2006)
examined the influence of interactive video on learning outcome and learner satisfaction
in e-learning environments. Four different settings were studied: three were e-learning
environments – with interactive video, with non-interactive video, and without video. The
fourth was the traditional classroom environment. Results of their experiment showed
that the value of video for learning effectiveness was contingent upon the provision of
interactivity. Students in the e-learning environment that provided interactive video
achieved significantly better learning performance and a higher level of learner
satisfaction than those in other settings. However, students who used the e-learning
environment that provided non-interactive video did not improve either. Their findings
suggest that it may be important to integrate interactive instructional video into e-learning
systems.
Generally, an interactive video in e-learning applications can support more learners
by integrating several options adapted to several types of users with different knowledge
levels. For example, short quizzes can be integrated in the content, and associated with
different scenes depending of the result obtained by the user (Meixner et al., 2010).
However, such an application goal requires the use of proper multimedia analysis
techniques, which can offer various means to understand multimedia content
automatically (Lian, 2011). To this direction, Repp and Meinel (2006) proposed a
semantic indexing scheme for recorded educational lecture videos. To get access to the
lecture’s content, the audio layer with the recorded voice of the lecturer are analysed.
Based on speech recognition, their approach generates a time stamp for words and
clusters of words, such that search engines can find the exact position of particular
information inside a video recording. From another perspective, Kim and Yoon (2010)
have focused on lectures represented as synchronised multimedia data streams, and have
introduced an e-learning model that supports the harmonised media. In their framework,
recorded video lectures are annotated with the content of the slides presented by the
lecturer. Therefore, the content of a given slide is synchronised with the corresponding
learning video sequence. This is achieved by using a recommendation engine that is
based on semantics analysis. Finally, Teese (2007) describes the ‘LivePhoto Physics’
project that is producing homework and in-class assignments for teaching introductory
physics using video analysis techniques. These techniques span a typical year-long
course, including electricity, heat engines and physical optics as well as mechanics.
To the best of our knowledge, the effect of interactive video on e-learning is still not
well understood. This effect is strongly depends on the video analysis techniques used in
the e-learning systems. Multimedia analysis offers various means to understand
multimedia content automatically. A multimedia analysis task involves processing of
multimodal data in order to obtain valuable insights about the data, a situation, or a
higher-level activity. Examples of multimedia analysis tasks include content-based image
search, semantic concept detection, audio-visual speaker detection, human tracking, event
detection, automatic news programme segmentation, etc. For these tasks, multimedia data
174 D.N. Kanellopoulos
used could be sensory (such as audio, video, RFID) as well as non-sensory (such as
WWW resources, database).
Video content analysis (VCA) can automatically analyse learning video sequences to
detect and determine temporal events not based on a single image. The algorithms can be
implemented as software on general-purpose machines, or as hardware in specialised
video processing units. Based on the internal representation that VCA generates in the
machine, it is possible to build novel functionalities for e-learning applications such as
video browsing based on story segmentation and video recommendation based on content
similarity. It is noteworthy that multimodal fusion is the integration of multiple media,
their associated features, or the intermediate decisions in order to perform a multimedia
analysis task. Atrey et al. (2010) surveyed the state-of-the-art research related to
multimodal fusion and commented on these works from the perspective of the usage of
different modalities, the levels of fusion, and the methods of fusion. There are many
multimedia analysis tasks that have been successfully performed using a variety of fusion
methods in a wide range of domains. However, in the e-learning domain few efforts
(Pauli et al., 2007; Repp and Meinel, 2006; Teese, 2007; Kim and Yoon, 2010) can be
recorded such as automatic lessons highlight extraction.
The aim of this paper is to present research efforts on multimedia analysis tasks for
e-learning applications and to stimulate further research. The paper is mainly focused on
investigating key issues related with e-learning video services based on multimedia
analysis techniques. The remainder of the paper is structured as follows. Section 2
discusses multimedia analysis techniques, while Section 3 presents e-learning video
services based on media analysis. Section 4 presents learner’s experiences based on
media analysis. Section 5 discusses various multimedia-based assignments for learning
with video annotation. Finally, Section 6 concludes the paper and gives directions for
future work.
2 Multimedia analysis techniques
Multimedia analysis techniques1 can be used in a wide range of domains including

e-learning, entertainment, health care, retail, automotive, transport, home automation,
safety and security. For example, in safety and security, we see various applications such
as sensitive content detection/filtering based on content understanding and mining,
intelligent surveillance based on event detection, piracy tracking based on content
fingerprinting, and person identification based on biometric recognition. Multimedia
analysis techniques can be categorised into four categories: text analysis, audio analysis,
image analysis and video analysis.
2.1 Text and audio analysis

Text analysis techniques are used to understand the text with machines. Such techniques
often focus on word spotting, which partitions the sentence into words. Another
interesting topic is semantic text analytic which recognises the spotted words, including
term detection or categorisation, named entities extraction or categorisation, and thematic
segmentation, etc. Not to forget to mention the text retrieval techniques used in all
internet search engines. In particular, text retrieval can be classified into two main
classes:
1 Word-based indexing and document retrieval that addresses the precise syntactic
properties of a text, comparable to substring matching in string searches. The text
can be generally unstructured and not necessarily in a natural language. An example
for word-based indexing is a suffix tree algorithm.
2 Content-based indexing approach is based on semantic connections between
documents and semantic connections between queries and documents (Dang and
Owczarzak, 2008).
Audio analysis techniques extract information and meaning from audio sources signals
for analysis, classification, storage, retrieval, synthesis, etc. (Chen and Jang, 2009). These
techniques include speech to text transcription, speaker indexing, anchorperson detection,
audio fingerprint, musical analysis, and so on. Audio analysis techniques are based on
current advances in speech recognition and machine listening in order to automatically
locate, manipulate, skim, browse, and index audio streams. Such techniques often simply
discriminate speech from non-vocal music or other sounds. Note that there is no
guarantee that a given multimedia audio source contains speech, and it is important not to
waste valuable resources attempting to perform speech recognition on music, silence, or
other non-speech audio.
2.2 Image analysis

Image analysis is based on digital image processing techniques in order to extract
meaningful information from digital images (Papadopoulos et al., 2009; Prats-Montalbán
et al., 2011). Image analysis tasks can be very complicated as identifying a person from
various available faces. Normally, image analysis techniques can be categorised into two
categories:
1 Object and face recognition: Object recognition is a long-established technique that
detects and recognises objects in digital images. The representative issue is logo
recognition and video OCR recognition. The face recognition is implemented based
on the face image, that is to say, from the face image, the face is detected and
recognised.
2 Automatic image categorisation and search: The image categorisation organises
images into groups consistent with the image’s properties, such as texture, colour or
motion. On the other hand, the image search provides the system with capabilities for
browsing, searching and retrieving images from a large database of digital images.
There are two kinds of image search:
a content-based image search
b annotation-based image search.
The former one retrieves images based on their visual similarity to a user-supplied
query image or user-specified image features. It avoids the use of textual
descriptions. Annotation-based image search exploits a number of techniques of
adding metadata such as keywords, captioning, or descriptions to the images and
thus search can be performed over the annotation words.
2.3 Video analysis

Video analysis techniques incorporated into e-learning applications can analyse video for
specific data, behaviour, objects or attitude. These techniques can also evaluate the
contents of video to determine specified information about the content of that video.
Nowadays, various video analysis techniques2 exist (Ekin et al., 2003).
• Shot boundary detection (SBD): This technique is used to split up a video sequence
into basic temporal units called shots. A shot is a series of interrelated consecutive
pictures/frames taken contiguously by a camera. The partitioning is achieved by
detecting the boundaries between neighbouring shots (Hanjalic, 2002; Joyce and Liu,
2006). TRECVid3 is a large-scale, worldwide benchmarking activity running
annually, whose goal is to encourage research into tasks related to content-based
information retrieval on digital video. It does this by providing a large video test
collection, uniform scoring procedures, and a forum for organisations interested in
comparing their results. Between 2001 and 2007, TRECVid supported evaluation of
the task of SBD where a large variety of SBD techniques from 57 different research
groups worldwide were benchmarked each year on the same video using the same
scoring mechanisms and with the same manually created ground truth. Smeaton
et al. (2010) present an overview of the TRECVid SBD task and examine its
achievements. They give a high-level overview of the most significant of the
approaches taken to SBD.
• High-level feature extraction: It detects the high-level semantic concepts and
features (e.g., speech) from video sequences. Such semantics concepts occur very
often in educational video lectures. For example, in a specific video experiment on
physics, the concept of ball can be extracted.
• Event detection: This technique detects observations of events based on the event
definition. When detecting events, the temporal information is used. Educational
videos contain some information of events, such as ‘ball is falling’, ‘electrons are
turning around’, ‘teacher is solving a mathematical problem’, etc., which is often
significant information for learning purposes.
• Content-based copy detection: Educational videos can be modified. For example,
they can have a logo, some colour transforms, black borders, quality decreasing,
etc. A variety of transformations such as addition, deletion, modification (of aspect,
colour, contrast, encoding, etc.) is feasible. The ‘video copy detection’ is based
on detecting video copies from a video sample. It decides whether the video
segment belongs to an existing long video sequence or not by using feature-based
comparison. Thus, we can avoid copyright violations. Detecting copies is imperative
for copyright control, business intelligence and advertisement tracking, law
enforcement investigations, etc.
• Video search: Video search is more akin to browsing and find interesting content
(viz. segments of video containing persons, objects, events, locations, etc. of
interest). For this reason, intelligent means are used to improve the search accuracy
and efficiency. Writing algorithms that recognise what is inside the video will mean
the complete redevelopment of search efforts.
• Summarisation: Video summarisation creates a summary of a video that contains

high priority entities (expected highlights) and events from the original video. The
extraction principle depends on the expected highlights. For example, a highlight in
an educational video concerning mathematics can be the specific solution that is
given to a mathematical problem, etc. The video summary itself exhibits reasonable
degrees of continuity, and is free of repetition.
3 E-learning video services based on media analysis
There are various requirements concerning video consuming in e-learning. For example,
learners want to get certain learning content easily even though they cannot watch the
whole long video sequence due to limited time or bandwidths. On the other hand, service
providers want to know:
1 How to provide personalised services for different learners with variable interests
(Ardissono et al., 2004).
2 How to insert specific video notes (and perhaps advertisements) to various parts of
the video sequences without disturbing user’s consuming.
Multimedia analysis techniques such as video structuring for adapted/personalised
services, video recommendation based on content similarity, and ad (or note)
recommendation based on media analysis, can be adopted in order to satisfy such
requirements.
3.1 Video structuring for adapted services

Video structuring consists of more than a few techniques including shot boundary
detection, story segmentation, key frame extraction, and even object recognition. It is a
method that partitions the video sequence into small units (Zhang, 2006). As it is shown
in Figure 1, the typical video structuring process aims to partition the video sequence V
into m shots (V = {S0, S1,…,Sm–1}) by shot boundary detection, to cluster the m shots into
n groups (V = {G0, G1,…,Gn–1}, m ≥ n) by story segmentation, and to extract k key frames
(F = {f0, f1,…,fk–1}, k ≥ m) from the shots. These structured results can be used to improve
e-learning video services. For example, based on key frames, users can browse an
educational video. In another case, based on story segmentation, learners can watch
special news stories in a video of political sciences. Finally, we may have personalised
video browsing based on scene and person detection.
3.1.1 Video browsing based on key frames

Based on video analysis, the k key frames, (F = {f0, f1,…,fk–1}, k ≥ m) can be extracted
from the video sequence V and used to represent the video sequence. In particular, the
searched results in video search are often represented by feature images, e.g., one feature
image for one searched video sequence. If more key frames are used, then the storage
cost for key frames is increased. Figure 2 shows the searched video of Britney Spears,
which can be extended to a sequence of seven key frames. Thus, although there is not a
shot where Britney singing in the feature image, the key frames can provide the musical
information about the song ‘Gimme More’. The inventive method offers the suitable
music part of Britney for users to select the interested contents. The problem of this
process is the large storage cost for key frames. It is still difficult to tell whether the
corresponding video is the expected one, limited by the image’s capacity.
Figure 1 The video structuring process (see online version for colours)
Figure 2 Video browsing based on seven key frames – Britney Spears ‘Gimme More’ (see online
version for colours)
3.1.2 Learning cloud – based on story segmentation

Currently, learning video is frequently composed of a number of stories isolated by
anchorpersons (Misra et al., 2010). Given that, multimedia analysis techniques can
partition the learning video V into n stories, V = {G0, G1,…,Gn–1}. These stories can be
used by users/learners for personalised consuming over the internet. It is important that
these stories can form a learning cloud similar to the concept introduced in the
Newscloud. Newscloud (Iulio and Messina, 2008) is the system that segments news
content automatically into stories and provides search and navigation in news content. As
shown in Figure 3, the news video is on the left of the webpage, while its stories are on
the right. Among them, each story is indicated by an anchor shot, and users can watch the
interested story by click the corresponding shot. This news consuming system can be
extended to e-learning video sequences.
Figure 3 Newsclound* based on story segmentation – designers visualise learning cloud!

(see online version for colours)
Source: *2424actu, available at http://www.2424actu.fr/actualite-la-une

(accessed on 15 November 2010)
3.1.3 Personalised video browsing based on scene and person detection

Multimedia analysis techniques can partition the video sequence into various scenes, and
the key frames with various persons can be extracted. Consequently, the user can select
the one scene (from the scenes in the video) that interests him as starting. In addition,
user can select the interested face to watch the related video clips. In many courses (e.g.,
a video course on politics) it is beneficial to browse e-learning video based on scene or
person. The characteristic of ‘interactive transcript’ can be exploited for selecting scenes,
persons, events, etc. Figure 4 shows a video lecture on politics. Here, there is browsing
based on scene and using interactive transcript. This method provides the personalised
video browsing according to personal interests.
Figure 4 A video lesson on politics – video browsing* based on scene using interactive transcript
(see online version for colours)
Source: *Voxalead News, available at http://voxaleadnews.labs.exalead.com/

3.2 Video recommendation based on content similarity

An interesting service that can be provided to lecturers and learners is video
recommendation. This service analyses video content’s properties, and then recommends
the other content with the similar properties. Usually, the similarity between the given
video’s feature A = {a0, a1,…,aq–1}and the candidate video’s feature B = {b0, b1,…,bq–1}
are computed according to the followed distance:
∑ Dif ( a , b )
q −1
Ω( A, B ) =
1
i i
i =0
q
where the feature may be a shot, story, key frame, etc., and Dif() is the distance function
(Ekin et al., 2003). At the moment, most of the proposed recommendation schemes are
based on text analysis, while two schemes are dominant:
1 annotation-based recommendation
2 scene-based recommendation.
3.2.1 Annotation-based recommendation

Annotation-based video recommendation takes into account the annotated video
properties (i.e., sequence’s title, classification, producer, producing time, etc.) in
order to find a few other similar videos. By comparing the annotated properties, the
recommendation process tries to find the similar video sequences. This type of
recommendation has a good recommendation speed, while its accuracy depends on the
details of video annotation, and the annotation made before hand. Figure 5 shows such a
video recommendation scheme combined with video search. At this point, on the left side
of the webpage, we can find relevant videos similar with the searched one by selecting
additionally: length {short, medium, long}, screen type {standard, widescreen},
resolution {low, medium, high}, and source {msn, aol,…, youtube, etc}.
Figure 5 Video recommendation based on content similarity using the search engine Bing*
(the Parthenon is the interested video) (see online version for colours)
Source: *Bing, available at http://cn.bing.com/

3.2.2 Scene-based recommendation

In scene-based recommendation, the video sequence is segmented into scenes, and
definite feature is extracted from each scene. After that, the similar scenes from other
video sequences are compared with the considered one. That means that the scenes’
features are compared to decide the similarity. Moreover, based on the scenes’ similarity,
two video sequences’ similarity can also be determined, which tells the similar video
sequences. It is noteworthy that the advanced video search engine Blinkx4 provides two
levels of recommendation:
1 scene recommendation (for every scene, similar scenes are listed)
2 video recommendation (for a certain video sequence, similar videos are listed).
3.3 Video note recommendation based on media analysis

Despite the success of using video in education, one challenge is how to effectively
support video annotation. Video annotation is different from making notes on text-based
materials such as papers or digital text files (Smith et al., 2005). Without special
software, people cannot add notes on video margins, highlight a video shot, underline a
video frame, or mark a scene with an asterisk. Oftentimes users record their video notes
on an additional medium such as a paper. This makes a later use of video notes difficult
and prone to error. Users have to recall and then find the associated video segment in a
video player to rebuild the context when the notes are too ambiguous or are dependent on
the video content. For example, a comment on a dance video, ‘foot position is too high’,
needs to be presented along with a related video segment to get a fully understanding of
its meaning (Cherry et al., 2003). Video notes may appear in the e-learning service
webpage in order to affect learners’ experiences. A video note, a comment on a learning
video, e.g., ‘the velocity of the sphere is too high’, must be presented along with a related
video segment and is very useful. We view a video note similar to an advertisement, as
inserting ads in a video sequence is a well-established process technologically. Actually,
the idea of introducing advertisements5 in e-learning materials is inappropriate in most
academics. It is important to reduce the frequency of video notes happens and place the
video notes to the accurate points of video sequences. Thus, the experience of interested
users will be affected and improved. To this problem, multimedia analysis can provide a
suitable solution. That is, detecting a user’s interests by analysing the interested video
content, and then pushing the corresponding video notes to the user. For example, from
A = K ( A0 , A1 ,… , Ap −1 )
the p interested videos (A0, A1,…,Ap–1), the user’s interested feature is detected by
{ (
= K ( a00 , a10 ,… , a0p −1 ) ,… , K aq0−1 , a1q −1 ,… , aqp−−11 )}
At this point, K() may be the combination operation (Ekin et al., 2003; Chen and Jang,
2009), e.g., the mean. There exist two kinds of recommendation methods based on media
analysis, i.e., video-based note recommendation, and scene-based note recommendation.
3.3.1 Video-based note recommendation vs. scene-based note recommendation

While the user is watching a learning video content, it would be beneficial to see a
recommended video note that is in close relation with that video content. The relation
may be based on the video’s title, topic or key persons. Usually, the recommended video
note is played on the up-right corner of the page, while some other related video notes are
also listed on the bottom-right of the page. Video-based note recommendation can be
implemented either by video annotation before hand or by real-time content-based
analysis.
The scene in the video sequence can also be used to push video notes. For example,
the typical scene is recognised, and the corresponding video note is inserted. We can
insert video notes to a video sequence at certain scenes in order to inform users in more
details. In Youtube6, we can insert video notes in various scenes of the video sequence.
Imagine a special scene is detected from the video sequence and the analogous video note
is inserted in the corresponding video scene. Thus, the video note is indicated in the
corresponding position, and it is played in a semi-transparent manner when the user
clicks the video note banner. This method can be attractive to service providers if the
accuracy of scene detection can be confirmed.
4 Learner experiences based on media analysis
Learners can enjoy user-friendly browsing or interaction methods for internet videos by
using some tools that incorporate media analysis techniques. The characteristic examples
include the Videosphere based on media analysis, and interactive search based on content
similarity.
4.1 Videosphere based on media analysis

Videosphere7 provides a video browser that can put in the picture the relations between
multiple video sequences. As shown in Figure 6, all the videos can be combined into a
sphere, named Videosphere. Each video sequence is related with the neighbouring ones.
The relations can be defined in different aspects, e.g., the video’s title, the known person
in the video, the video’s topics, the scene in the video, etc. You can watch certain video
by clicking it, or investigate the relations by zooming in/out or rotating the sphere.
According to these properties, it is very functional for browsing the video database.
Figure 6 Videosphere (see online version for colours)

4.2 Interactive search based on content similarity

When users search for videos on the internet, the researched results are regularly listed in
a linear manner. However, this is not appropriate for video content browsing. The
researched results can be listed gradually according to the similarity between the searched
content and the requested one. For example, if you click on one video, this video will be
brought to the front, and the other videos will be re-listed according to the new similarity
between the searched content and the clicked one. This method can provide a friendly
interface for video search, and also an interactive search scheme for optimising the search
results.
4.3 Using tools for video annotation and navigation

Many efforts have been made to provide tools for annotation and navigation in various
types of videos in other domains. For example, in the sports video domain, Alastruey
(2009) presented the LongoMatch, while Beetz et al. (2009) proposed the ASPOGAMO
tool. Both these tools are used for analysing and annotating sports videos. In the
documentary domain, Kanellopoulos (in press) propose a system for the semantic
annotation and retrieval of audio-visual media objects concerning documentaries. This
system uses a manual annotation tool, an authoring tool and a search engine for the
documentary experts. To be more revealing on how specific parts of a video sequence
can be retrieved in his system, let assume that Figure 7 depicts the layered representation
of a shot of 100 frames, representing three accidental actions. Suppose a query “find a
piece of a natural history documentary from Africa, where documentarist is speaking and
touching a gorilla, while gorilla is eating a banana”. This question can be easily retrieved
by isolating the common parts of the shot as depicted in shaded portion of Figure 7.
Figure 7 Layered annotation of actions and isolated segment of a shot a query (see online version
for colours)
Existing video tools, such as 5min Live Videopedia, Quick.tv, Viddix Beta, VideoAnt,
VideoClix, and YouTube Video Annotations, lack several features needed for interactive
videos. These tools offer functions for integrating additional information in a video, but
they do not support non-linearity in the video flow. On the other hand, many authoring
tools provide/support non-linearity in the video flow, but they are so complex. For
example, high-end software like Adobe Creative Suite 5 (Adobe, 2010), Adobe Flash
CS4 Professional (Adobe Flash, 2010) and Microsoft Expression Studio 3 (Microsoft
Corporation, 2010) allow professionals and web-programmers to create pleasant
interactive videos, but they are too complex to use for non-experts, because of the large
set of features they provide.
Focusing on the e-learning domain, Aubert and Prie (2007) presented the Advene tool
that allows navigation via self defined annotations. Rehatschek and Kienast (2001)
presented the Vizard, while Smith and Lugeon (2000) presented the VideoAnnEx
annotation tool. Both tools provide algorithms for shot detection and annotation of the
shots. Bloehdorn et al. (2005) presented the tool M-OntoMat-Annotizer that supports
multimedia analysis, reasoning and retrieval. Renzel et al. (2009) presented the
Virtual Campfire, a tool that provides a lot of interactive features and a graph view for
non-linearity, but is designed for collaborative multimedia semantisation with mobile
social software. Bhikharie (2010) presented the eXtensible Interactive Media Player for
Entertainment and Learning (XIMPEL) that provides non-linearity in the video flow, but
no additional information can be added. The annotation of the most of the above tools is
based on the MPEG-7 standard.
Bargeron et al. (1999) developed a collaborative video annotation system, MRAS,
and compared it with handwritten note-taking during video viewings. MRAS was able to
contextualise the notes as it recorded the timestamp at each point in the video when notes
were written. While a timestamp does provide context for notes, it still requires user
effort to manually pinpoint the video control to the exact time. Mu (2010) developed a
new system called interactive shared education environment (ISEE) to facilitate
individual note-taking and collaborative video learning. Synchronisation between the
notes and their video context was supported using hyperlinked timestamps, which he
called Smartlink. His study tested the effect of this automatic synchronisation
function and explored how users make notes on video. With the goal of promoting
student-teachers to reflect on their teaching performance, Kong (2010) developed a
web-enabled video system to permit them to record their classroom performance and then
retrieve online videos of their teaching for self-reflection. His study aimed to evaluate the
effectiveness of online videos in facilitating self-reflection amongst student-teachers.
Dong et al. (2010) introduced an ontology-driven framework for the annotation and
access of educational video data. In their framework, the ontology-driven annotation is a
two-steps process:
1 video segmentation
2 video annotation data extraction and organisation.
In video segmentation, they propose and utilise multi-mode segmentation procedures for
the presented videos. To extract annotation data from videos and organise them in a way
that facilitates video access, they employed a multimedia annotation model that is based
on multiple ontologies.
5 Multimedia-based assignments for learning with video annotations
There are many benefits to using video in education (Mu, 2010). Video supports
playback that allows repeated watching for memory reinforcement or concept
clarification. Video-based materials can appear in the curricula of various sciences. When
video is introduced into a curriculum, learning activities generate a profound level of
engagement, better understanding of the content, or even an improvement in students’
cognitive capacities for learning from video. A user study on the effectiveness of video
demonstrated that using video in the learning process received very positive reactions and
the web-casting video improved students’ interactivity (Reynolds and Mason, 2002).
Caspi et al. (2005) further observed that new comprehension strategies were applied after
video was introduced. The introduction of video requires the completion of video
annotation. The richer the annotation data are, the more flexible the video access
becomes, and thus the more effective the video data can be utilised. Offline video editing
environments can support video annotation through the direct manipulation of the source
video material. In addition, web-based video streaming services increasingly support
direct referencing, embedding, and sharing of sub-selections of video. Standards
committees are close to finalising so-called ‘time-based’, ‘isochronic’ or ‘fine-grained’
metadata for web-based video resources. However, these standards do not specify the
design of user interfaces, workflows, and pedagogies leveraging these standards.
Actually, the design of the student experience around video may mainly continue
separately of the finalising of these standards, so the iterative design of these analysis
environments ought to inform their completion.
When the videos being studied are hosted on YouTube, students can be instructed to
utilise services like Splicd or TubeChop to clip specific selections and then to compose
their responses, incorporating their selections, in any multimedia authoring environment,
such as their learning management system, a course blog, or a Wiki. Students can be
assigned to watch and evaluate scenarios in order to get better their skills of observation,
interpretation, reasoning and judgment. Moreover, as production costs continue to fall,
we are also beginning to see more self-reflective learning activities in which students
capture their own original video for subsequent analysis. Video can be treated as a
manipulative object, as raw material to be controlled, segmented, re-organised, discussed
and debated as part of an active learning experience, and the instructor or students can
develop the narrative. According to Bossewitch and Preston (2012), there are six main
multimedia-based assignments for learning with video annotations:
• “Guided lessons: Instructors can pre-select video clips and organise them into a
specific sequence to be viewed by students, who must answer questions associated
with each video segment.
• Lecture comprehension: Students are assigned to view a recorded interview or
lecture and then select three segments and comment on them. Students are then
instructed that comments should be in their own words and to avoid repeating the
words of the source. The first comment should be one that they think is a novel
notion. The second should be something they do not understand, a difficult idea, or
something they want to understand better. The third is a segment that they think is
related to the current classroom dialogue.
• Close object analysis with targeted comparisons: Students can work with a curated
collection of multimedia learning objects, and select two objects to closely compare
and contrast. They can work individually to write comparison essays, embedding
specific annotations from within the object to illustrate and support their claims.
Next, students can be asked to study the comparison projects of other students in the
class.
• Communal hunting and gathering, with in-class synthesis: Students are introduced to
a curated collection of sources, but are also encouraged to explore pertinent cultural
representations available on the open web. In this environment, the annotations that
students create are shared across the class, and an explicit learning objective is the
transference of a ‘judicial selection’ of source material from faculty to students.
Students gather objects and then compare their selections during in-class discussion.
Finally, students compose final projects that incorporate these annotations.
• Collective analysis across semesters of a core set of resources: Students explore an
archive of a serialised work, such as a digitised newspaper, to investigate patterns
that emerge over time but might not otherwise be detected by the typical consumer
of the source material, and whose focus might be less critical or not longitudinal.
These findings are collected and shared in a class investigation of a particular
resource.
• Reflection on self-evaluations/performances: Students can videotape their own
performances as pre-service teachers, therapists, doctors, etc., and then can write an
analysis, self-critique or reflection, embedding clips from their performances to
illustrate points raised, according to criteria established by the instructor. Students
learn to recognise successful and unsuccessful behaviours they can correct and to
utilise self-reflection as a tool for ongoing improvement as a professional.”
6 Conclusions and future research directions
This paper provides lecturers and learners with a state-of-the-art overview of multimedia
analysis techniques used mainly in internet video for e-learning purposes. The paper
presents also some open issues for further research in the area of video analysis for
e-learning applications. Even though multimedia analysis is not mature, some simple
techniques have been used for internet video services. Consequently, it is expected that
more multimedia analysis techniques will be improved and internet video services
focusing on e-learning will be enriched. However, Lian (2011) states:
“Multimedia analysis is not practical for real-time applications in internet video
for two main reasons: (1) reliability of multimedia analysis; (2) speed of
multimedia analysis.”
Most of the multimedia analysis techniques such as shot boundary detection, and news
video segmentation are simple and suitable for practical applications because they have
satisfied reliability. However, regarding video content’s diversities, it is still difficult to
confirm great accuracy in content segmentation, recognition or search (Joyce and Liu,
2006). Video analysis is frequently of high computational cost because of the large data
volumes of video content (Over et al., 2008). This in turn may limit video analysis’s
application in real-time scenarios. The potential solution is to annotate some videos
before hand. However, for living videos, the analysis method itself needs to be optimised.
There are several areas of investigation that may be explored in the future. We have
identified some of them as follows:
• The research community must consider how to provide on the web personalised
video services for different learners with variable interests.
• Potential research should focus on the design and the development of applications
that will be based on the notion of ‘learning cloud’ which is based on story
segmentation.
• Future research must consider if ad (and note) recommendation in e-learning video
applications is practical. In particular, video-based ad (or note) recommendation
versus scene-based ad (or note) recommendation must be examined.
• Even though many authoring tools for video editing support non-linearity in the
video flow, they are too complex to use by non-experts (lecturers) because of the
large set of features they provide. Therefore, authoring tools (simple to use by
non-experts) for video annotation and navigation for e-learning applications should
be designed, developed and evaluated.
References
Adobe (2010) Adobe Creative Suite 5 Master Collection, April 2010, available at http://www.
adobe.com/products/creativesuite/mastercollection/ (accessed on 15 November 2010).
Adobe Flash (2010) Adobe Systems, Adobe Flash CS4 Professional Website, April 2010,
available at http://www.adobe.com/de/products/flash/features/?view=topoverall (accessed on
15 November 2010).
Alastruey, A.M. (2009) LongoMatch: The Digital Coach, April 2009, available at
http://longomatch.ylatuya.es/documentation/manual.html (accessed on 15 November 2010).
Ardissono, L., Kobsa, A. and Maybury, M.T. (Eds.) (2004) Book Series: Human-computer
Interaction Series: Vol. 6. Personalized Digital Television: Targeting Programs to Individual
Viewers, 331pp, Springer, Berlin.
Atrey, P.K., Hossain, M.A., El Saddik, A. and Kankanhalli, M.S. (2010) ‘Multimodal fusion for
multimedia analysis: a survey’, Multimedia Systems, Vol. 16, No. 6, pp.345–379.
Aubert, O. and Prie, Y. (2007) ‘Advene: an open-source framework for integrating and visualising
audiovisual metadata’, MULTIMEDIA ’07: Proc. of the 15th International Conference on
Multimedia, ACM, New York, NY, USA, pp.1005–1008.
Bargeron, D., Gupta, A., Grudin, J. and Sanocki, E. (1999) ‘Annotations for streaming video on the
web: system design and usage studies’, CHI ’99 Conference on Human Factors in Computing
Systems, pp.61–75.
Beetz, M., von Hoyningen-Huene, N., Kirchlechner, B., Gedikli, S., Siles, F., Durus, M. and
Lames, M. (2009) ‘ASpoGAMo: automated sports game analysis models’, International
Journal of Computer Science in Sport, Vol. 8, No. 1.
Bhikharie, W. (2010) XIMPEL Overview, available at http://www.cs.vu.nl/~eliens/im/report-im08-
ximpel.pdf (accessed on 15 November 2010).
Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Avrithis, N. Kompatsiaris, S.H.Y. and
Strintzis, M.G. (2005) ‘Semantic annotation of images and videos for multimedia analysis’,
Proc. of the 2nd European Semantic Web Conference, ESWC 2005, June 2005, Heraklion,
Crete, Greece, pp.592–607.
Bossewitch, J. and Preston, M. (2012) ‘Teaching and learning with video annotations’, available at
http://learningthroughdigitalmedia.net/teaching-and-learning-with-video-annotations (accessed
on 12 January 2012).
Caspi, A., Gorsky, P. and Privman, M. (2005) ‘Viewing comprehension, students’ learning
preferences and strategies when studying from video’, Instructional Science, Vol. 33, No. 1,
pp.31–47.
Chen, Z-S. and Jang, J-S. (2009) ‘On the use of anti-word models for audio music annotation and
retrieval’, IEEE Transactions on Audio, Speech and Language Processing, Vol. 17, No. 8,
pp.1547–1556.
Cherry, G., Fournier, J. and Stevens, R. (2003) ‘Using a digital video annotation tool to teach dance
composition’, Interactive Multimedia Electronic Journal of Computer-enhanced Learning,
Vol. 5, No. 1, available at http://imej.wfu.edu/articles/2003/1/01/index.asp (accessed on
15 November 2010).
Corporation for Public Broadcasting (2004) ‘Television goes to school: the impact of video on
student learning in formal education’, Corporation for Public Broadcasting, Washington, DC,
available at http://www.cpb.org/stations/reports/tvgoestoschool/tvgoestoschool.pdf (accessed
on 02/03/2011).
Dang, H.T. and Owczarzak, K. (2008) ‘Overview of the TAC 2008 update summarization task’,
2008 Text Analysis Conference (TAC 2008), 17–19 November 2008, Gaithersburg, Maryland
USA.
Dong, A., Li, H. and Wang, B. (2010) ‘Ontology-driven annotation and access of educational video
data in e-learning’, in Soomro, S. (Ed.): E-learning Experiences and Future, ISBN: 978-953-
307-092-6, InTech, available at http://www.intechopen.com/books/e-learning-experiences-
and-future/ontology-driven-annotation-and-access-of-educational-video-data-in-e-learning
(accessed on 10 December 2011).
Ekin, A., Tekalp, A.M. and Mehrotra, R. (2003) ‘Automatic soccer video analysis and
summarization’, IEEE Transactions on Image Processing, Vol. 12, No. 7, pp.796–807.
Hanjalic, A. (2002) ‘Shot-boundary detection: unraveled and resolved?’, IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 12, No. 2, pp.90–105.
Iulio, M.D. and Messina, A. (2008) ‘Use of probabilistic clusters supports for broadcast news
segmentation’, Proceeding of 19th International Conference on Database and Expert Systems
Application, pp.600–604.
Joyce, R. and Liu, B. (2006) ‘Temporal segmentation of video using frame and histogram space’,
IEEE Transactions on Multimedia, Vol. 8, No. 1, pp.130–140.
Kanellopoulos, D. (in press) Semantic Annotation and Retrieval of Documentary Media Objects,
The Electronic Library, Emerald.
Kim, S. and Yoon, Y. (2010) ‘Synchronization e-learning model for harmonizing presentation’,
9th IEEE/ACIS International Conference on Computer and Information Science, pp.451–456.
Kong, S.C. (2010) ‘Using a web-enabled video system to support student–teachers’ self-reflection
in teaching practice’, Computers & Education, Vol. 55, No. 4, pp.1772–1782.
Lian, S. (2011) ‘Innovative internet video consuming based on media analysis techniques’,
Electronic Commerce Research, Vol. 11, No. 1, pp.75–89.
Meixner, B. et al. (2010) ‘SIVA suite: authoring system and player for interactive non-linear
videos’, MM’10, 25–29 October 2010, Firenze, Italy, pp.1563–1566.
Merkt, M., Weigand, S., Heier, A. and Schwan, S. (2011) ‘Learning with videos vs. learning with
print: the role of interactive features’, Learning and Instruction, Vol. 21, No. 6, pp.687–704.
Microsoft Corporation (2010) Microsoft Expression Studio 3 website, April 2010, available at
http://www.microsoft.com/germany/Expression/ (accessed on 10 November 2011).
Misra, H., Hopfgartner, F., Goyal, A., Punitha, P. and Jose, J. (2010) ‘TV news story segmentation
based on semantic coherence and content similarity’, Advances in Multimedia Modeling,
LNCS, Vol. 5916/2010, pp.347–357, DOI: 10.1007/978-3-642-11301-7_36.
Mu, X. (2010) ‘Towards effective video annotation: an approach to automatically link notes with
video content’, Computers & Education, Vol. 55, pp.1752–1763.
Over, P., Awad, G., Rose, T., Fiscus, J., Kraaij, W. and Smeaton, A.F. (2008) ‘TRECVID 2008 –
goals, tasks, data, evaluation mechanisms and metrics’, ACM Multimedia 2008, October,
Vancouver, Canada.
Papadopoulos, G.T., Saathoff, C., Grzegorzek, M., Mezaris, V., Kompatsiaris, I., Staab, S. and
Strintzis, M.G. (2009) ‘Comparative evaluation of spatial context techniques for semantic
image analysis’, 10th Workshop on Image Analysis for Multimedia Interactive Services
(WIAMIS’09), May 2009, pp.161–164.
Pauli, C., Reusser, K. and Grob, U. (2007) ‘Teaching for understanding and/or self-regulated
learning: a video-based analysis of reform-oriented mathematics instruction in Switzerland’,
International Journal of Educational Research, Vol. 46, No. 5, pp.294–305.
Prats-Montalbán, J.M., de Juan, A. and Ferrer, A. (2011) ‘Multivariate image analysis: a review
with applications’, Chemometrics and Intelligent Laboratory Systems, Vol. 107, No. 1,
pp.1–23.
Rehatschek, H. and Kienast, G. (2001) “VIZARD – an innovative tool for video navigation,
retrieval, annotation and editing’, Proceedings of the 23rd workshop of PVA ‘Multimedia and
Middleware’, May 2001, available at http://www.video-wizard.com/index-n.htm (accessed on
10 November 2011).
Renzel, D., Klamma, R., Cao, Y. and Kovachev, D. (2009) ‘Virtual campfire – collaborative
multimedia semantization with mobile social software’, Proc. of the 10th International
Workshop of the Multimedia Metadata Community on Semantic Multimedia Database
Technologies (SeMuDaTe’09), December 2009, Graz, Austria.
Repp, S. and Meinel, C. (2006) ‘Semantic indexing for recorded educational lecture videos’, 4th
Annual IEEE Int. Conference on Pervasive Computing and Communications Workshops
(PERCOMW’06).
Reynolds, P. and Mason, R. (2002) ‘On-line video media for continuing professional development
in dentistry’, Computers & Education, Vol. 39, No. 1, pp.65–98.
Smeaton, A., Over, P. and Doherty, A. (2010) ‘Video shot boundary detection: seven years of
TRECVid activity’, Computer Vision and Image Understanding, Vol. 114, No. 4, pp.411–418.
Smith, B.K., Blankinship, E. and Lackner, T. (2005) ‘Annotation and education’, IEEE Multimedia,
Vol. 7, No. 2, pp.84–89.
Smith, J.R. and Lugeon, B. (2000) ‘Visual annotation tool for multimedia content description’,
Proc. SPIE, Vol. 4210, pp.49–59.
Sorensen, C. and Baylen, D.M. (1999) ‘Interaction in interactive television instruction: perception
versus reality’, Proceedings of the Annual Meeting of the American Educational Research
Association, Montreal, Quebec, Canada.
Teese, R. (2007) ‘Video analysis – a multimedia tool for homework and class assignments’, 12th
International Conference on Multimedia in Physics Teaching and Learning, 13–15 September
2007, Wroclaw, Poland.
Zhang, D., Zhao, J.L., Zhou, L. and Nunamaker, J. (2004) ‘Can e-learning replace traditional
classroom learning – evidence and implication of the evolving e-learning technology’,
Communications of the ACM, Vol. 47, No. 5, pp.75–79.
Zhang, D., Zhou, L., Briggs, R. and Nunamaker, J. (2006) ‘Instructional video in e-learning:
assessing the impact of interactive video on learning effectiveness’, Information &
Management, Vol. 43, No. 1, pp.15–27.
Zhang, Y-J. (2006) Advances in Image and Video Segmentation, Idea Group Pub., Hershey.
Notes
1 Multimedia Understanding through Semantics, Computation and Learning (MUSCLE),
available at http://muscle.ercim.org/ (accessed on 15 November 2010).
2 TREC Video Retrieval Evaluation, available at http://www-nlpir.nist.gov/projects/trecvid/
(accessed on 15 November 2010).
3 Ibid.
4 Blinkx, available at http://www.blinkx.com/ (accessed on 15 November 2010).
5 Leexoo video search, available at http://adgrid.leexoo.com/ (accessed on 15 November 2010).
6 Youtube, available at http://www.youtube.com/ (accessed on 15 November 2010).
7 Videosphere, available at http://www.netexplorateur.org/videospheres/english/ (accessed on
15 November 2010).

Multimedia Analysis Techniques For E-Lea

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Multimedia Analysis Techniques For E-Lea

Încărcat de

Drepturi de autor:

Formate disponibile

172 Int. J. Learning Technology, Vol. 7, No.

Multimedia analysis techniques for e-learning

Abstract: Multimedia analysis techniques can enable e-learning systems and

Keywords: multimedia; video analysis; learning with videos; internet video.

Reference to this paper should be made as follows: Kanellopoulos, D.N.

Biographical notes: Dimitris N. Kanellopoulos holds a PhD in Multimedia

Copyright © 2012 Inderscience Enterprises Ltd.

2 Multimedia analysis techniques

Multimedia analysis techniques1 can be used in a wide range of domains including

2.1 Text and audio analysis

2.2 Image analysis

2.3 Video analysis

• Summarisation: Video summarisation creates a summary of a video that contains

3 E-learning video services based on media analysis

3.1 Video structuring for adapted services

3.1.1 Video browsing based on key frames

3.1.2 Learning cloud – based on story segmentation

Figure 3 Newsclound* based on story segmentation – designers visualise learning cloud!

Source: *2424actu, available at http://www.2424actu.fr/actualite-la-une

3.1.3 Personalised video browsing based on scene and person detection

Source: *Voxalead News, available at http://voxaleadnews.labs.exalead.com/

3.2 Video recommendation based on content similarity

3.2.1 Annotation-based recommendation

Source: *Bing, available at http://cn.bing.com/

3.2.2 Scene-based recommendation

3.3 Video note recommendation based on media analysis

3.3.1 Video-based note recommendation vs. scene-based note recommendation

4 Learner experiences based on media analysis

4.1 Videosphere based on media analysis

Figure 6 Videosphere (see online version for colours)

4.2 Interactive search based on content similarity

4.3 Using tools for video annotation and navigation

5 Multimedia-based assignments for learning with video annotations

6 Conclusions and future research directions

S-ar putea să vă placă și