Sunteți pe pagina 1din 9

Univariate Analysis

A variable in univariate analysis is just a condition or subset that your data falls into. You can think of it as a
“category.” For example, the analysis might look at a variable of “age” or it might look at “height” or
“weight”. However, it doesn’t look at more than one variable at a time otherwise it becomes bivariate
analysis (or in the case of 3 or more variables it would be called multivariate analysis).
The following frequency distribution table shows one variable (left column) and the count in the right
column.

A frequency chart.
You could have more than one variable in the above chart. For example, you could add the variable
“Location” or “Age” or something else, and make a separate column for location or age. In that case you
would have bivariate data because you would then have two variables.

1. Univariate data –
This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of
analysis since the information deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it.
The example of a univariate data can be height.

Suppose that the heights of seven students of a class is recorded,there is only one variable that is height and it
is not dealing with any cause or relationship. The description of patterns found in this type of data can be
made by drawing conclusions using central tendency measures (mean, median and mode), dispersion or
spread of data (range, minimum, maximum, quartiles, variance and standard deviation) and by using
frequency distribution tables, histograms, pie charts, frequency polygon and bar charts.
There are several options for describing data with univariate data. Click on the link to find out more about
each type of graph or chart:

 Frequency Distribution Tables.


 Bar Charts.
 Histograms.
 Pie Charts.

Frequency Distribution
Frequency distribution is a table that displays the frequency of various outcomes in a sample. Each entry in
the table contains the frequency or count of the occurrences of values within a particular group or interval,
and in this way, the table summarizes the distribution of values in the sample.
Example
Problem Statement:
Constructing a frequency distribution table of a survey was taken on Maple Avenue. In each of 20 homes,
people were asked how many cars were registered to their households. The results were recorded as follows:

1 2 1 0 3 4 0 1 1 1 2 2 3 2 3 2 1 4 0 0

Solution:
Steps to be followed for present this data in a frequency distribution table.
1. Divide the results (x) into intervals, and then count the number of results in each interval. In this case,
the intervals would be the number of households with no car (0), one car (1), two cars (2) and so
forth.
2. Make a table with separate columns for the interval numbers (the number of cars per household), the
tallied results, and the frequency of results in each interval. Label these columns Number of cars,
Tally and Frequency.
3. Read the list of data from left to right and place a tally mark in the appropriate row. For example, the
first result is a 1, so place a tally mark in the row beside where 1 appears in the interval column
(Number of cars). The next result is a 2, so place a tally mark in the row beside the 2, and so on.
When you reach your fifth tally mark, draw a tally line through the preceding four marks to make
your final frequency calculations easier to read.
4. Add up the number of tally marks in each row and record them in the final column entitled Frequency.
Your frequency distribution table for this exercise should look like this:

Frequency table for the number of cars registered in each household

Number of cars (x) Tally Frequency (f)

0 llll 4

1 llll l 6

2 llll 5

3 lll 3

4 lll 3

By looking at this frequency distribution table quickly, we can see that out of 20 households surveyed, 4
households had no cars, 6 households had 1 car.
Bar Chart
A bar chart is a graphical representation of the categories as bars. Each bar's height is proportional to the
quantity of the category that it represents. A bar chart can be plotted vertically or horizontally. Usually it is
drawn vertically where x-axis represents the categories and y-axis represents the values for these categories.

Problem Statement:

Create a bar chart that illustrates the various sports played by the boys and girls in a society.

Sr. No. Sport Boys Girls

1 Basket Ball 15 10
Sr. No. Sport Boys Girls

2 Volley Ball 15 10

3 Badminton 10 15

4 Cricket 10 20

5 Football 20 15

Solution:

Use the x-axis to draw the boys and girls data and use y-axis to draw the category as sports. Draw the chart
as shown below:

Histogram

A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the


probability distribution of a continuous variable (quantitative variable).
Problem Statement:
Every month one measure the amount of weight one's dog has picked up and get these outcomes:

0.5 0.5 0.3 -0.2 1.6 0 0.1 0.1 0.6 0.4

Draw the histogram demonstrating how much is that dog developing.


Solution:
monthly development vary from -0.2 (the fox lost weight that month) to 1.6. Putting them in order from lowest
to highest weight gain.

-0.2 0 0.1 0.1 0.3 0.4 0.5 0.5 0.6 1.6

We decide to put the results into groups of 0.5:


 The -0.5 to just below 0 range.
 The 0 to just below 0.5 range, etc.
And here is the result:

There are no values from 1 to just below 1.5, but we still show the space.
Pie Chart
A pie chart (or a pie graph) is a circular statistical graphical chart, which is divided into slices in order to
explain or illustrate numerical proportions. In a pie chart, centeral angle, area and an arc length of each slice
is proportional to the quantity or percentages it represents. Total percentages should be 100 and total of the
arc measures should be 360° Following illustration of pie graph depicts the cost of construction of a house.

From this graph, one can compare the sum spent on cement, steel and so on. One can also compute the actual
sum spent on each individual expense. Consider an example, where we want to know how much more is the
labour cost when compared to cost of steel.
Amount spent on labor =9060×600000=$ 150000 Sum spent on steel =54360×600000=$ 90000 Excess=150
000−90000=$ 60000 Let 60000=x% of 600000.⟹x100×600000=$ 60000.⟹x=10% of total expense.

2. Bivariate data –
This type of data involves two different variables. The analysis of this type of data deals with causes and
relationships and the analysis is done to find out the relationship among the two variables. Example of
bivariate data can be temperature and ice cream sales in summer season.
 Suppose the temperature and ice cream sales are the two variables of a bivariate data. Here, the
relationship is visible from the table that temperature and sales are directly proportional to each other
and thus related because as the temperature increases, the sales also increase. Thus bivariate data
analysis involves comparisons, relationships, causes and explanations. These variables are often
plotted on X and Y axis on the graph for better understanding of data and one of these variables is
independent while the other is dependent.
Scatterplot
A scatterplot is a graphical way to display the relationship between two quantitative sample
variables. It consists of an X axis, a Y axis and a series of dots where each dot represents one
observation from a data set. The position of the dot refers to its X and Y values.

Patterns of Data in Scatterplots

Scatterplots are used to analyze patterns which generally varies on the basis of linearity, slope, and
strength.
 Linearity - data pattern is either linear/straight or nonlinear/curved.
 Slope - direction of change in variable Y with respect to increase in value of variable X. If Y increases
with increase in X, slope is positive otherwise slope is negative.

 Strength - Degree of spreadness of scatter in the plot. If dots are widely dispersed, the relationship is
consider weak. If dot are densed around a line then the relationship is said to be strong.

(a) Correlation

Correlation : - Correlation is a statistical technique that can show whether and how strongly pairs of
variables are related. For example, height and weight are related; taller people tend to be heavier than
shorter people. The relationship isn't perfect. People of the same height vary in weight, and you can
easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the
average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is
less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Correlation can be determined by: Rating Scale & Correlation Coefficient
Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some
sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color.

Rating Scales

Rating scales are a controversial middle case. The numbers in rating scales have meaning, but that meaning
isn't very precise. They are not like quantities. With a quantity (such as dollars), the difference between 1 and
2 is exactly the same as between 2 and 3. With a rating scale, that isn't really the case. You can be sure that
your respondents think a rating of 2 is between a rating of 1 and a rating of 3, but you cannot be sure they
think it is exactly halfway between. This is especially true if you labeled the mid-points of your scale (you
cannot assume "good" is exactly half way between "excellent" and "fair").

Most statisticians say you cannot use correlations with rating scales, because the mathematics of the
technique assume the differences between numbers are exactly equal. Nevertheless, many survey researchers
do use correlations with rating scales, because the results usually reflect the real world. Our own position is
that you can use correlations with rating scales, but you should do so with care. When working with
quantities, correlations provide precise measurements. When working with rating scales, correlations provide
general indications.

Correlation Coefficient

The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The
closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one
variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets
smaller (often called an "inverse" correlation).

While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes
then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in
one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5
means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related
(.7 squared = .49).

A correlation report can also show a second result of each test - statistical significance. In this case, the
significance level will tell you how likely it is that the correlations reported may be due to chance in the form
of random sampling error. If you are working with small sample sizes, choose a report format that includes
the significance level. This format also reports the sample size.

A key thing to remember when working with correlations is never to assume a correlation means that a
change in one variable causes a change in another. Sales of personal computers and athletic shoes have both
risen strongly over the years and there is a high correlation between them, but you cannot assume that buying
computers causes people to buy athletic shoes (or vice versa).

(b) Regression

In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it. For
example, global warming may be reducing average snowfall in your town and you are asked to predict how
much snow you think will fall this year. Looking at the following table you might guess somewhere around
10-20 inches. That’s a good guess, but you could make a better guess, by using regression.

Essentially, regression is the “best guess” at using a set of data to make some kind of prediction. It’s fitting a
set of points to a graph. There’s a whole host of tools that can run regression for you, including Excel, which
I used here to help make sense of that snowfall data:

Just by looking at the regression line running down through the data, you can fine tune your best guess a bit.
You can see that the original guess (20 inches or so) was way off. For 2015, it looks like the line will be
somewhere between 5 and 10 inches! That might be “good enough”, but regression also gives you a useful
equation, which for this chart is:
y = -2.2923x + 4624.4.
What that means is you can plug in an x value (the year) and get a pretty good estimate of snowfall for any
year. For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches for that
year.
Correlation and Regression are the two analysis based on multivariate distribution. A multivariate
distribution is described as a distribution of multiple variables. Correlation is described as the analysis
which lets us know the association or the absence of the relationship between two variables ‘x’ and ‘y’.
On the other end, Regression analysis, predicts the value of the dependent variable based on the known
value of the independent variable, assuming that average mathematical relationship between two or
more variables.
The difference between correlation and regression is one of the commonly asked questions in
interviews. Moreover, many people suffer ambiguity in understanding these two. So, take a full read of
this article to have a clear understanding on these two.
Basis for
Comparison Correlation Regression
Correlation is a statistical measure which Regression describes how an
determines co-relationship or association independent variable is numerically
Meaning of two variables. related to the dependent variable.
Basis for
Comparison Correlation Regression
To represent linear relationship between To fit a best line and estimate one
Usage two variables. variable on the basis of another variable.
Dependent and
Independent
variables No difference Both variables are different.

(c) Chi-square

A chi-square (χ2) statistic is a test that measures how expectations compare to actual observed data
(or model results). The data used in calculating a chi-square statistic must be random, raw, mutually
exclusive, drawn from independent variables, and drawn from a large enough sample. For example,
the results of tossing a coin 100 times meet these criteria.

Chi-square tests are often used in hypothesis testing.

There are two main kinds of chi-square tests: the test of independence, which asks a
question of relationship, such as, "Is there a relationship between gender and SAT scores?";
and the goodness-of-fit test, which asks something like "If a coin is tossed 100 times, will
it come up heads 50 times and tails 50 times?"

For these tests, degrees of freedom are utilized to determine if a certain null hypothesis can
be rejected based on the total number of variables and samples within the experiment.

For example, when considering students and course choice, a sample size of 30 or 40
students is likely not large enough to generate significant data. Getting the same or similar
results from a study using a sample size of 400 or 500 students is more valid.

S-ar putea să vă placă și