Documente Academic
Documente Profesional
Documente Cultură
What is Parallelism?
parallelism is the
idea of breaking down a task so that, instead of one process doing all of the work in a
query, many processes do part of the work at the same time.
Do you think it will create the problem of non-standardized attributes, if one source uses
0/1 and second source uses 1/0 to store male/female attribute respectively? Give a reason to
support your answer. 2 marks
Yes It Will Create the Problem of Non-standardized attritubes because the data from 2
resources is in inconsistent in column.
Two ways in which parallelizm can reduce systems performance. 2 marks\
Parallelism can reduce system performance on over-utilized systems or systems with
small I/O bandwidth
There are two primary techniques for gathering requirements i.e. interviews or facilitated
sessions. Kimball prefers using which one? 2 marks.
There are two primary techniques for gathering requirements i.e. interviews or facilitated
sessions. Both have advantages and disadvantages. Interviews encourage lot of individual
participation. They are also easier to schedule. Facilitated sessions may reduce the time
elapsed to gather requirements, although they require more time commitment from each
participant. Kimball prefers using hybrid approach with interviews to gather the gory
details and then facilitation to bring the group to consensus.
Why Analytic Track is considered as fun part. 2 marks.
Write any three complete warehouse deliverable. 3 marks
Data
Analytic applications
Data access tools
Education tools
This query was given SELECT*FROM R WHERE A= 5 and we have to tell which
technique is appropriate from dense, sparse, B-tree and has indexing. 5 marks
B-tree indexes are the most common index type used in typical OLTP applications and
provide excellent levels of functionality and performance. Used in both OLTP and data
warehouse applications, they speed access to table data when users execute queries with
varying criteria, such as equality conditions and range conditions. B-tree indexes improve
the performance of queries that select a small percentage of rows from a table.
Why should companies entertain students to visit their company's place? 5 marks
Exception table
EmpName
Ali
Faisal
Waseem
Arham
IsAgeValid
1
1
1
1
Age
28
32
389
398
We have to write a query to access employee table and set the value of IsAgeValid =0 where
age is greater than and equal to 25 and less than and equal to 75. 5 marks
Pest scouting
Pest scouting is a systematic field sampling process that provide field specific
information on pest pressure and crop injury.
Conventional indexes
Basic Types:
Sparse
Dense
Multi-level (or B-Tree)
4 types of partitioning
Hash partitioning
Key range partitioning
List partitioning
Round robin
1. It is called a _____________ violation, if we have null values for attributes where NOT NULL constraint
exists
Load
Transform
Slackly
Lethargically
1. The application development quality assurance activities cannot be completed until the data is
_____________
Finalized
Computerized
Composing
Decomposing page 41
Joining/merging
Combining
Lexical errors
2.
3.
4.
Irregularities
5.
Duplication
i and ii and v
i and ii
Additive
Non-additive
Association
Non-association
1. Give least time to ____ can prove suicidal attempt of DWH project
OLAP
De-normalization
K is equal to square of n
1. The ith bit is set to 1, if ith row of the base table has the value for the indexed column. The
statement refer to
Inverted
Sparse index
1. __________ is a systematic sampling process that provides field specific information on pest
pressure and crop injury.
Seed survey
Water survey
1. In context of web data ware house. Which is NOT one of way to identify session
Some mcqs from my midterm paper. 2 underlined MCQs are also included in my
final paper
1. The telecommunication data warehouse is dominated by the sheer volume of data generated at
the call level _________ area.
Subject page 35
Object
Aggregate
Details
1. 3NF remove even more data redundancy than 2NF but it is at the cost of
No of table
Relations
1. In full extraction, data extracted completely from source. No need to keep track of change to the
_________
Data mart
Data destination
Ad-hoc access
Complete repository
Historical data
Volatile page 27
1. Experienced showed that for a single pass of magnetic tape that scanned 100% of the record
only________ of the records.
5% page 12
30%
50%
80%
1. HOLAP provides a combination of relational database access and cube data structures within a
single framework. The goal is to get the best of both MOLAP and ROLAP:
scalability and high performance page 78
1. ____________ are created out of the data warehouse to service the needs of different
departments such as marketing, sales etc.
MIS
OLAPs
Answer:
1. Bitmap index: run length encoding ka ek question tha input di hoi output find out kerni thi Page
no.234
Answer: If we apply Run length Encoding on the input 11001100, the output will be
12#02#12#02
cases we can formulate a mechanism to correct gender. We can either use a standard
gender guide or create a new table Gender_guide. Gender_guide contains only two
columns name and gender. Populate Gender_guide table by a query for selecting all
distinct first names from student table. Then manually placing their gender.
This table can serve us as guide by telling what can be the gender against this particular
name. For example if we have hundred students in our database with first name equal to
Muhammad. Then in our Gender guide table we will have just one entry Muhammad
and we will manually set the gender as Male against Muhammad. Now to fill missing
genders in exception table we will just do an inner join on Error table and Gender guide
table.
run length encoding on these 2 ad-hoe the or output btana the .
Run length used in bitmap indexing
Output 1 may be
15#02# 18# (mean 1 come 5 time and 0 come 2 times and 1 come 1 8 times
(111110011111111))
Output 2 may be
11#01#11#
Output 3 may be
112#012#
Step of Kimball approach for data life cycle.
Kimball Process. Four step approach. (Business process-->Grains-->Facts-->dimension).
He defines a business process as a major operational process in the organization that is
supported by some kind of legacy system (or systems). (Read "Business Development
Lifecycle") page see #290
Drawback of traditional web search. Ch: 39 page 351
1. Limited to keyword based matching.
2. Cannot distinguish between the contexts in which a link is used.
3. Coupling of files has to be done manually.
Two ways of session describe in World Wide Web.
Identifying the Session
Web-centric data warehouse applications require every visitor session (visit) to have its own
unique identity
The basic protocol for the World Wide Web, HTTP, stateless so session identity must be
established in some other way.
MCQs
Execution will be terminated abnormally.... (Quiz 4 file- 2 MCQs)
Kimballs approach ......driven (quiz 4 file-5 mcqs)
Pipeline per increase through..... (Quiz 4 file- 1 mcq)
Selectivity of query in olap... (Queries must be executed in a small number of seconds.)
star schema simplify ...
Majority of data ...fail if (Majority of projects fail due to the complexity of the
development process.)
er is .......design (constituted to optimize OLTP performance)
Survival of fittest is.....algorithm (Genetic Algorithms: These are based on the principle
survival of the fittest. In these techniques, a model is formed to solve problems having
multiple options and many values. Briefly, these techniques are used to select the optimal
solution out of a number of possible solutions. However, are not much robust as can not
perform well in the presence of noise.
Shipy in kobol develop....... (In 1972 the Mitsubishi Shipyards in Kobe developed a
technique in which customer wants were linked to product specifications via a matrix
format. Technique is known today as The House of Quality and is one of many
techniques of Quality Function Deployment, which can briefly be defined as a system
for translating customer requirements into appropriate company requirements. The
purpose of the technique is to reduce two types of risk. First, the risk that the product
specification does not comply with the wants of the predetermined target group of
customers. Secondly, the risk that the final product does not comply with the product
specification
Q: 1 briefly explains any two types of precedence constraints that we can use in DTS.
Answer: page 395
Precedence constraints sequentially link tasks in a package. In DTS, you can use
three types of precedence constraints, which can be accessed either through DTS
Designer or programmatically:
Unconditional: If you want Task 2 to wait until Task 1 completes, regardless of the
outcome, link Task 1 to Task 2 with an unconditional precedence constraint.
On Success: If you want Task 2 to wait until Task 1 has successfully completed, link
Task 1 to Task 2 with an On Success precedence constraint.
On Failure: If you want Task 2 to begin execution only if Task 1 fails to execute
successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If you want
to run an alternative branch of the workflow when an error is encountered, use this
constraint.
Q:2 Time complexity of K-means algorithm is O(tkn) what does t,k,and n represents
here?
Page 281
Answer: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations.
Normally, k, t n.
Q: 3 what are the problems you will face if low priority is given to cube construction?
Answer: page 313
Low priority for OLAP Cube Construction: Make sure your OLAP cube-building or precalculation process is optimized and given the right priority. It is common for the data
warehouse to be on the bottom of the nightly batch loads, and after the loading the DWH,
usually there isn't much time left for the OLAP cube to be refreshed. As a result, it is
worthwhile to experiment with the OLAP cube generation paths to ensure optimal
performance.
Q: 4 List down any two parallel software Architectures?
Answer: Shared Memory, Shard Disk and Shared Nothing
Q: 5 what is unsupervised learning in Data mining?
Answer: page 27
Unsupervised learning where you dont know the number of clusters and obviously no
idea about their attributes too. In other words you are not guiding in any way the DM
process for performing the DM, no guidance and no input. Unsupervised learning is
closer to the exploratory spirit of Data Mining as small a stressed in the definitions given
above. In unsupervised learning situations all variables are treated in the same way, there
is no distinction between explanatory and dependent variables. However, in contrast to
the name undirected data mining there is still some target to achieve. This target might be
as general as data reduction or more specific like clustering. For unsupervised learning
typically either the target variable is unknown or has only been recorded for too number
of cases.
Q: 6 which scripting language are used to perform complex transformations in Data
packages?
Answer: Microsoft SQL Server provides graphical tools to build DTS packages. These
tools provide good support for transformations. Complex transformations are achieved
through VB Script or Java Script that is loaded in DTS package. Package can also be
programmed by using DTS object model instead of using graphical tools but DTS
programming is rather complicated.
Q: 7 "Dense index consist of a number of bit vector" justify it .
Answer Dense Index: Every key in the data file is represented in the index file. Bitmap
index record (Value, Bit Vector): Bit Vector has one bit for every record in the file, ith bit
of Bit Vector is set off record it has Value in the given column. Bit vectors typically
compressed. Converted to sets of rids during query evaluation.
Q :8 It is essential: to have a sub-matter expert as part of data modeling team . What will
be the implication if such expert is not present in organization?
Answer: It is essential to have a subject-matter expert as part of the data modeling team.
This person can be an outside consultant or can be someone in-house with extensive
industry experience. Without this person, it becomes difficult to get a definitive answer
on many of the questions, and the entire project gets dragged out, as the end users may
not always be available
Suppose there is a large enterprise which uses the same server for the development and
production environments. What problems can arise if it uses single server for both
purposes? 5m
To save capital, often data warehousing teams will decide to use only a single database
and a single server for the different environments i.e. development and production.
Environment separation is achieved by either a directory structure or setting up distinct
instances of the database.
This is awkward for the following reasons:
Sometimes it is possible that the server needs to be rebooted for the development
environment. Having a separate development environment will prevent the production
environment from being effected by this.
There may be interference while having different database environments on a single
server. For example, having multiple long queries running on the development server
could affect the performance on the production server, as both are same.
Write down any two drawbacks if Date is stored in text format rather than using
proper date format like dd-MMM-yy etc. 5m
In context of Web data warehousing, consider the web page dimension, list at least five
possible attributes of this dimension. 5m
Page key
Page source
Page function
Page template
Item type
Graphic type
Animation type
Sound type
Page file name
There are different data mining techniques e.g. clustering, description etc. Each of
the following statement corresponds to some data mining technique. For each statement
name the technique the statement corresponds to. 5m
a) Assigning customers to predefined customer segments (i.e. good vs.
bad) classification
b) Assigning credit applicants to predefined classes (i.e. low, medium, or high risk)
classification
c) Guessing how much customers will spend during next 6 months prediction
d) Building a model and assigning a value from 0 to 1 to each member of the set. Then
classifying the members into categories based on a threshold value. Estimation
e) Guessing how much students will score more than 65% grades in midterm. Prediction
Specify at least one implication, if you dont provide proper documentation as part of
data warehouse development.3 m
Usually by this time most, if not all, of the developers will have left the project, so it is
essential that proper documentation is left for those who are handling production
maintenance. There is nothing more frustrating than staring at something another person
did, yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another
phase of the data warehouse planned, start on that as soon as possible.
In context of nested loop join, mention two guidelines for selecting a table as inner table.
3m
For a Nested-Loop join inner and outer tables are determined as follows: page 242
The outer table is usually the one that has:
The smallest number of qualifying rows, and/or
The largest numbers of I/Os required to locate the rows.
The inner table usually has:
The largest number of qualifying rows, and/or
The smallest number of reads required to locate rows
We can identify the Session in Word Wide Web by using Time-contiguous Log Entries
however there are some limitations of this technique. Briefly explain any two limitations.
3m
Answer: A session can be consolidated by collecting time-contiguous log entries from the
same host (Internet Protocol, or IP, address). In many cases, the individual hits
comprising a session can be consolidated by collating time-contiguous log entries from
the same host (Internet Protocol, or IP, address). If the log contains a number of entries
with the same host ID in a short period of time (for example, one hour), one can
reasonably assume that the entries are for the same session.
Limitations: This method breaks down for visitors from large ISPs because different
visitors may reuse dynamically assigned IP addresses over a brief time period.
Different IP addresses may be used within the same session for the same visitor.
This approach also presents problems when dealing with browsers that are behind some
firewalls.
Identify the given statement as correct or incorrect and justify your answer in either case.
"The problem of Referential Integrity always occurs in traditional OLTP system as well
as in DWH". 3m
Answer: While doing total quality measurement, you measure RI every week (or month)
and hopefully the number of orphan records will be going down, as you will be fine
tuning the processes to get rid of the RI problems. Remember, RI problem is peculiar to a
DWH, this will not happen in a traditional OLTP system.
There are two primary techniques for gathering requirements i.e. interviews or facilitated
sessions. Which technique is preferred by Ralph Kimball? 2m
Both have their advantages and disadvantages. Interviews encourage lots of individual
participation. They are also easier to schedule. Facilitated sessions may reduce the
elapsed time to gather requirements, although they require more time commitment from
each participant. Kimball prefers using a hybrid approach with interviews to gather the
gory details and then facilitation to bring the group to consensus.
List down any two Parallel Software Architectures? 2m
Brief Intro to Parallel Processing:
Parallel Hardware Architectures
Shared Memory
Shard Disk
Shared Nothing
Types of parallelism
Data Parallelism
Spatial Parallelism
List down any four Static Attributes recorded by the scouts in Agriculture Data
Warehouse Case Study. 2m
Static attributes Dynamic attributes
Farmer name
Date of visit
CLCV
Variety sown
Predator population
Sowing date
Answer= Kimball also proposes a four-step approach where he starts to choose a business
process, takes the grain of the process, and chooses dimensions and organization that is
supported by some kind of legacy system (or systems).facts. He defines a business
process as a major operational process in the
4. There are four categories of data quality improvement. Write any two. (2 marks)
Ans. The four categories of Data Quality Improvement
Process
System
Policy & Procedure
Data Design
1. Data profiling is a process which involves gathering of information. What are the purposes that it
must fulfill? (3 marks)
Domain of a column
We run different SQL queries to get the answers of above questions. During this
process we can identify the erroneous records. Whenever we will come across an
erroneous record, we will just copy it in error or exception table and set the dirty bit of
record in the actual student table. Then we will correct the exception table. After this
profiling process we will transform the records and load them into a new table
Student_Info
Ref: Handout Page No. 354
7. Apply Run length encoding on the given code and write output. (3 marks)
Case-I:
1111111110000111
Answer: 19#04#13
Case-II:
00001111000000
Answer: 04#14#06
8. Identify the given statement as correct or incorrect and justify your answer in either
case. (3 marks)
"One-way clustering is used to get local view and Two-way clustering is used to get
global view."
Answer: Incorrect
One-way clustering gives global view and bi-clustering gives local view
9. A pilot project strategy is highly recommended in data warehouse. What are the
reasons for its recommendation? (5 marks)
Answer: A pilot project strategy is highly recommended in data warehouse construction,
as a full blown data warehouse construction requires significant capital investment, effort
and resources. Therefore, the same must be attempted only after a thorough analysis, and
a valid proof of concept. A small scale project in this regard serves many purposes such
as (i) Show users the value of DSS information, (ii) establish blue print processes for later
full-blown project, (iii) identify problem areas and, (iv) reveal true data demographics.
Hence doing a pilot project on a small scale seemed to be the best strategy.
10. Data acquisition and cleansing. (5 marks)
The pest scouting sheets are larger than A4 size (8.5 x 11), hence the right end was
cropped when scanned on a flat-bed A4 size scanner.
The right part of the scouting sheet is also the most troublesome, because of pesticide
names for a single record typed on multiple lines i.e. for multiple farmers.
As a first step, OCR (Optical Character Reader) based image to text transformation of
the pest scouting sheets was attempted. But it did not work even for relatively clean
sheets with very high scanning resolutions.
Subsequently DEOs (Data Entry Operators) were employed to digitize the scouting
sheets by typing.
Data cleansing and standardization is probably the largest part in an ETL exercise. For
Agri-DWH major issues of data cleansing had arisen due to data processing and handling
at four levels by different groups of people i.e.
(i)
(ii)
(iii)
(iv)
12. 1 table dia hua tha us mein Name, item, time aur gender dia hua tha aur sath ye
statement di Hui thi. (5 marks)
IF
Items/Time >= 6
Then
Gender= F
else
Gender = M
a) Find the accuracy % of given data.
Subjective:
1. 1.
Architecture?
Answer
Shared nothing RDBMS architecture requires a static partitioning of each table in the
database.
How do you perform the partitioning?
Hash partitioning
List partitioning.
Round-Robin
1. 2.
1. 3.
Answer:
Nested-Loop Join: Variants
1. Naive nested-loop join
2. Index nested-loop join
Answer:
Answer= Limitations
It's possible that the visitor will have his or her browser set to refuse cookies or may clean out
his or her cookie file manually, so there is no absolute guarantee that even a persistent cookie
will survive.
Although any given cookie can be read only by the Web site that caused it to be created,
certain groups of Web sites can agree to store a common ID tag that would let these sites
combine their separate notions of a visitor session into a super session
1. 7. as the number of processes increase, the speedup should also increase. Thus theoretically
there should be a linear speedup; however this is not the case in real. List at least 2 barrier of
linear speedup.
Answer:
Amdahl Law
Startup
Interference
Skew
1. 8.
In context of nested loop join, mention two guide lines for outer table.(answer in current
solution file)
1. 9.
before sitting down with the business community to gather information, it is suggested
to set you up for a productive session. Write three activities requirement preplanning phase
Answer:
Requirements preplanning: This phase consists of activities like choosing the
forum, identifying and preparing the requirements team and finally selecting, scheduling
and preparing the business representatives.
Do you think it will create the problem of non-standardized attributes, if one source uses
0/1 and second source uses 1/0 to store male/female attribute respectively? Give a reason to
support your answer. 2 marks
There are two primary techniques for gathering requirements i.e. interviews or facilitated
sessions. Kimball prefers using which one? 2 marks.
. Kimball prefers using a hybrid approach with interviews to gather the gory details and
then facilitation to bring the group to consensus.
Why Analytic Track is considered as fun part. 2 marks.
Write any three complete warehouse deliverable. 3 marks
This query was given SELECT*FROM R WHERE A= 5 and we have to tell which
technique is appropriate from dense, sparse, B-tree and has indexing. 5 marks
Why should companies entertain students to visit their company's place? 5 marks
Exception table
EmpName
Ali
Faisal
Waseem
Age
28
32
389
4
EmpID
1
2
3
4
.
Arham
IsAgeValid
1
1
1
1
398
We have to write a query to access employee table and set the value of IsAgeValid =0 where
age is greater than and equal to 25 and less than and equal to 75. 5 marks
Answer: page 27
Unsupervised learning where you dont know the number of clusters and obviously no
idea about their attributes too. In other words you are not guiding in any way the DM
process for performing the DM, no guidance and no input. Unsupervised learning is
closer to the exploratory spirit of Data Mining as small a stressed in the definitions given
above. In unsupervised learning situations all variables are treated in the same way, there
is no distinction between explanatory and dependent variables. However, in contrast to
the name undirected data mining there is still some target to achieve. This target might be
as general as data reduction or more specific like clustering. For unsupervised learning
typically either the target variable is unknown or has only been recorded for too number
of cases.
Pest scouting (2)
Correct statement (2)
Table tha index table bnana tha (5)
Aik table tha us ki SQL query daini thi (5)
Hash partitioning
Key range partitioning.
List partitioning.
Round-Robin
Subject Oriented
Integrated
Nonvolatile
Time Variant
There was also one question of Output of run length encoding (2 or 3 marks)
Mcqs are mostly from past papers almost 35 out of 40
cs614FinaltermSolvedMCQsWithReferencesbyMoaaz.pdf
CS614FinaltermSolvedMCQsWithReferencesUpdate.pdf
Quizes b sary daikh lijye ga
Hash partitioning
which common measurements can be used to measure the success of specific email, advertisement, marketing?
Success of email or other marketing campaign can be measured by integrating with other operational systems. Common
measurements are: Number of visitors Number of sessions Most requested pages Robot activity Etc.
Being a part of training team specify three guidlines that you consider as part of effective user education program?
Some options are:
Invest in just-in-time training (provided by data warehousing tool vendors) Use pilot projects as seeds for new technology
training Develop reward systems that encourage experimentation Use outside system integrators and individual consultants