Sunteți pe pagina 1din 308

InfoStat

Statistical Software

Users Manual
Version 2012

Data Management

InfoStat
Users Manual
Version 2012

InfoStat software and documentation are the result of the active and multidisciplinary
participation of all the members of Grupo InfoStat, who are Copyright owners. Principal
responsibilities and activities are as follows:

Programming: Julio A. Di Rienzo


Quality control: Fernando Casanoves, Laura A. Gonzalez, Mnica G. Balzarini
Editorial director of the Users Manual: Fernando Casanoves, Julio A. Di Rienzo
Electronic version of the Users Manual: Fernando Casanoves
Online help: Elena M. Tablada
Citation for this manual is as follows:
Casanoves F., Balzarini M.G., Di Rienzo J.A., Gonzalez L., Tablada M., Robledo C.W.
(2012). InfoStat. User Manual, Crdoba, Argentina
The software to which this manual refers should be cited as follows:
Di Rienzo J.A., Casanoves F., Balzarini M.G., Gonzalez L., Tablada M., Robledo C.W.
InfoStat versin 2012. InfoStat Group, Facultad de Ciencias Agropecuarias, Universidad
Nacioal de Crdoba, Argentina. URL http://www.infostat.com.ar
Total or partial reproduction of this reference in identical or modified form, by any means,
mechanical or electronic, including photocopying, recording or through the use of any
information storage and recuperation system not authorized by the Copyright owners, is
prohibited.

Data Management

Prologue
InfoStat is a statistical software developed by Grupo InfoStata team of professionals in
Applied Statistics, with a center at the Faculty of Agronomy at Cordoba National University
(Facultad de Ciencias Agropecuarias, Universidad Nacional de Crdoba). The following
professors of Statistics and Biometry participated in the elaboration of InfoStat: Julio A. Di
Rienzo, Mnica G. Balzarini, Fernando Casanoves, Laura A. Gonzalez, Elena M.
Tablada, and Carlos W. Robledo. InfoStat is a synthesis of experiences accumulated since
1982. It has been enriched by teaching experiences at the undergraduate and graduate
levels, consulting in Statistics and the development of human resources in Applied Statistic.
We are proud of InfoStats level of acceptance within university environments, at research
and technological institutions, and among businesses devoted to the production of goods and
services.
This manual consists of four chapters: Data Management, Statistics, Graphs and
Applications. The chapter on Data Management contains information on how to operate the
program in order to use files, and it describes the activities that can be done with data tables.
The chapter on Statistics describes the methodological tools that the user can select in
analyzing his or her data. These descriptions are accompanied by examples of their
implementation using InfoStat, and they are based on numerous real situations in which the
application of one or more statistical techniques is beneficial. The chapter on Graphics also
uses examples to describe the different types of graphical representations available. The
chapter on Applications shows statistical methods used in the statistical quality control, the
quantification of biodiversity and computational tools used to facilitate the teaching-learning
process of classical statistical concepts.
This manual reflects the state of development of InfoStat at the time of print; nevertheless,
InfoStat keeps growing, improving and upgrading algorithms and user interfaces. Through
InfoStats Help Menu, users can access the electronic version of the manual and a link to
upgrade the manual.

ii

Data Management

Table of contents
Installation ____________________________________________________________ 1
Upgrading _____________________________________________________________ 1
Requirements __________________________________________________________ 1
General aspects_________________________________________________________ 2
Data Management ______________________________________________________ 5
File ________________________________________________________________________ 5
New table _________________________________________________________________ 5
Open table ________________________________________________________________ 5
Save table_________________________________________________________________ 8
Save table as ______________________________________________________________ 9
Close table ________________________________________________________________ 9
Edit ________________________________________________________________________ 9
Data ______________________________________________________________________ 11
New row ________________________________________________________________ 12
Insert row ________________________________________________________________ 12
Delete row _______________________________________________________________ 12
Deactivate case ___________________________________________________________ 12
Activate case _____________________________________________________________ 13
Invert selection ___________________________________________________________ 13
Choosing cases ___________________________________________________________ 13
New column______________________________________________________________ 15
Insert column _____________________________________________________________ 15
Delete column ____________________________________________________________ 15
Edit Labels _______________________________________________________________ 15
Read labels from ________________________________________________________ 16
Data type ________________________________________________________________ 16
Alignment _______________________________________________________________ 16
Decimals ________________________________________________________________ 16
Automatically adjust columns ________________________________________________ 16
Sort ____________________________________________________________________ 16
Categorize _______________________________________________________________ 18
Edit categories ____________________________________________________________ 20
Transforming _____________________________________________________________ 20
Create dummy variables ____________________________________________________ 23
Fill... ___________________________________________________________________ 23
Formula _________________________________________________________________ 29
Search __________________________________________________________________ 33
Resampling ______________________________________________________________ 33
Color selection ____________________________________________________________ 33
Merge tables _____________________________________________________________ 34
Rearrange columns, one under the other ________________________________________ 34
Rearrange rows as columns __________________________________________________ 34
Create a new table using active cases __________________________________________ 35
Merge categories __________________________________________________________ 35

iii

Data Management
Output ____________________________________________________________________ 35
Upload results ____________________________________________________________ 35
Save results ______________________________________________________________ 35
Decimals ________________________________________________________________ 35
Field separator ____________________________________________________________ 36
Typography ______________________________________________________________ 36
Export results to table ______________________________________________________ 36

Statistics _____________________________________________________________ 37
Descriptive statistics _________________________________________________________ 38
Summary statistics _________________________________________________________ 38
Frequency tables __________________________________________________________ 40
Probabilities and quantiles ___________________________________________________ 42
Estimators of population characteristics __________________________________________ 43
Definitions of terms associated with the sampling technique ________________________ 43
Simple random sample _____________________________________________________ 45
Stratified sample __________________________________________________________ 47
Stratified sampling _________________________________________________________ 49
Sample size calculation _______________________________________________________ 51
Estimating a mean with a given precision _______________________________________ 51
Inference in one and two populations ____________________________________________ 53
Inference based on one sample _______________________________________________ 53
Two-sample inference ______________________________________________________ 60
Analysis of variance __________________________________________________________ 71
Completely random design __________________________________________________ 74
Block design _____________________________________________________________ 77
Latin square design ________________________________________________________ 79
Comparaciones Mltiples ___________________________________________________ 94
ANOVA assumptions _____________________________________________________ 101
Analysis of covariance_____________________________________________________ 105
Non-parametric analysis of variance ____________________________________________ 107
Kruskal-Wallis test _______________________________________________________ 107
Friedman test ____________________________________________________________ 108
Validation of assumptions __________________________________________________ 118
Regression with dummy variables ____________________________________________ 123
Non linear analysis of regression _______________________________________________ 128
Correlation analysis _________________________________________________________ 132
Correlation between distance matrices ________________________________________ 135
Categorical data analysis _____________________________________________________ 136
Contingency tables _______________________________________________________ 136
Logistic regression ________________________________________________________ 146
Kaplan-Meier survival analysis ______________________________________________ 148

Multivariate Analysis __________________________________________________ 153


Multivariate descriptive statistics _______________________________________________ 154
Hierarchical clustering methods _____________________________________________ 163
Non-hierarchical clustering methods __________________________________________ 167
Distances _______________________________________________________________ 167
Principal components ________________________________________________________ 167
Canonical correlations _______________________________________________________ 180
Partial Least Squares Regression _______________________________________________ 184
iv

Data Management
Multivariate analysis of variance _______________________________________________ 188
Distances and association matrices _____________________________________________ 196
Principal coordinates analysis _________________________________________________ 205
Classification-regression trees _________________________________________________ 207
Biplot and MST ____________________________________________________________ 208
Generalized Procrustes analysis ________________________________________________ 210
Cross-correlations __________________________________________________________ 219
Box and Jenkins methodology (ARIMA) ________________________________________ 222
Fitting and smoothing _______________________________________________________ 235
Series Tab ______________________________________________________________ 239
Legends ________________________________________________________________ 245

Aplications __________________________________________________________ 267


Quality control _____________________________________________________________ 267
Control chart for attributes__________________________________________________ 269
Variable control charts_____________________________________________________ 274
Confidence intervals ______________________________________________________ 287
All possible samples ______________________________________________________ 289
Sampling from the empirical distribution ______________________________________ 291
Biodeversity indexes ______________________________________________________ 295

Data Management

Installation
To install InfoStat, enter our web page www.infostat.com.ar, download the installer and run
it. Once the installation is successfully completed, the installer will have created a folder
called InfoStat in C:\Program files\ and an icon for direct access at desktop.
Inside the InfoStat folder, C:\Program files\InfoStat, you should find the
following information:
Data file: contains all the Data files to which this manual refers.
Help file: contains the Online Help file.
Manual.pdf file: contains the printed material that comes along with the CD. The electronic
version may contain an updated version of the printed material.

Upgrading
Upgrading instructions can be accessed through the HELP menu. The UPGRADE option
opens the InfoStat web page, where the latest applications can be downloaded.

Requirements
Processor required: Pentium or superior
Minimum suggested memory: 128 Mb
Operating systems: Windows XP or newer.
Monitor configuration: minimum 800 x 600 pixel definition, small fonts. The configuration
of large fonts may cause problems in viewing part of the windows displayed by InfoStat
during use.

Data Management

General aspects
InfoStat offers different tools so that the user can easily explore information. When InfoStat
is opened, a toolbar appears on the topmost window of the program; it contains the
following menus: File, Edit, Data, Results, Statistics, Graphics, Windows, Help, and
Applications.
Below the menus, the toolbar contains a series of buttons that allow the user to perform
actions quickly. All of the actions that can be performed with the buttons can also be
performed from one of the menus listed above.

By positioning the mouse over a button, but without clicking, the user can visualize a help
label over the button as well as a legend at the foot of the screen, indicating the type of
action that can be performed with that button. These actions are as follows (for buttons
ordered from left to right): New table, Open table, Save active table, Export table, Print,
New column, Sort, Categories, Font, Align left, Align center, and Align right.
At the foot of the screen, the user will visualize three minimized windows, one named
Results, another Graphs, and another Graphical Tools. If the Results window is
maximized as soon as the program is opened, InfoStat will report that there are no results
available. This window will receive content as actions (analyses) that generate results are
performed. The Graphs and Graphical Tools screens are only activated when a graphic is
generated.
In the FILE menu, InfoStat allows the user to open and save different types of data files. For
example, if New Table is activated, the following screen will appear:

By using the keyboard, the user can enter information in the table or file temporarily named
New. Using this table, the user can perform data analysis as well as produce results and
2

Data Management

graphics. The Exit command, used to close the application, can also be found in the FILE
menu.
Commands for cutting, copying and pasting information from data, results and graphics
windows can be found in the EDIT menu. The DATA menu allows the user to conduct
different types of operations on a data grid. It is possible to order a file, transform columns,
generate new columns based on formulas, simulate random variables, and automatically find
and replace information, among other actions. From the OUTPUT menu the user can invoke
actions related to the presentation and exportation of results in table format.
All of the generated results (tables and graphs) can be copied using the EDIT menu (Copy)
and can then be pasted in the word processor. This is the simplest way to transport results
from InfoStat to a document or written report. The use of the Copy and Paste commands is
also the simplest way to import and export data between InfoStat and a word processor or
electronic spreadsheet program such as Excel. In order to simplify the transportation of data
spreadsheets, InfoStat provides the user with the commands Copy and Paste including
column names, in order to preserve the names and labels of columns. It is also possible to
import and export information in ASCII format. In this chapter, the options from the FILE,
EDIT, DATA and OUTPUT menus are described with examples.
InfoStat works with three types of windows: one where data are found (Data), one where
results and procedures are solicited (Results), and one where graphs created by the user are
shown and stored (Graphs). Several data windows can be kept open simultaneously. In
such cases, the active window is the one in the front, with a colored frame (not gray). All
actions will be executed on the active data window. The Results and Graphs windows
contain a sheet for each result and/or graph produced. The user can move across the
different sheets by clicking once on the labels found at the foot of the window, which
indexes the results.
In the STATISTICS menu, in an almost automatic manner (through the use of dialogue
windows), InfoStat makes it possible to implement an ample variety of statistical analyses.
The user can calculate descriptive statistics; calculate probabilities; estimate population
characteristics with different sampling plans; calculate inference statistics for one and two
samples by using different types of confidence intervals and hypothesis tests (parametric
and non-parametric); use regression models and analysis of variance for different types of
experimental designs and observational studies; use inference statistics for categorical data;
use multivariate statistics; do time series analysis; soften and adjust graphs.
After selecting the desired statistics application to be used in analyzing the data of an open
table (active table), a window (Variables) appears in which all the files columns are listed
on the left-hand side, so that the user can select the column(s) to be included in the
analysiseither as the variable of interest or as classification criteria. The selected columns
should be transported to the list of Variables, which is found on the right-hand side of the
window, using the button that contains the arrow. If a variable was incorrectly selected
or it is no longer necessary, it can be eliminated from the list of variables and added again to
the list of columns in the file by pressing the button, after having selected the variable
or having double clicked on it.
3

Data Management

The variable selector facilitates analysis, making it unnecessary to remember or write down
the names of the variables each time they are to be used.
In the GRAPHS menu, InfoStat provides professional style graphical tools for the
presentation of results. Various graphical techniques are employed, and they are described
in the chapter entitled Graphs. The program allows the inclusion of several series in a
single graph and the virtual edition of all attributes, by using the Graphical Tools window,
which automatically opens up when a graph is requested. InfoStat has an algorithm for
copying and subscribing formats which facilitates the creation of graphical series with
identical characteristics. Graphs created by InfoStat can be saved or copied and pasted into
any Windows application that supports images (enhanced metafile) by using the classic Cut
and Past (or Paste Special) Windows commands. All the tools on the GRAPHS menu are
available in every version of InfoStat.
Through the WINDOWS menu, the user can move from one window to another. Another
way to access a window is to simply move the cursor to the desired window. The Windows
menu also allows the user to select the mode in which the open windows are presented on
the screen. The windows can be presented in cascade, vertically or horizontally by selecting
the appropriate optionCascade, Align vertical, or Align horizontal. From this menu, the
user can access the OUTPUT menu, where the results of a session that the user has not
deliberately erased are stored. Similarly, the user can move to the Graphs window. The
names of open data tables are also listed.
Through the HELP menu, the user can access online documentation regarding procedures
and types of statistical analysis which can be implemented from any of the enabled menus,
as well as access an electronic version of the InfoStat manual. Moreover, this menu can be
used to gain fast access to software updates.
In the APPLICATIONS menu, traditional analysis tools are available, and these can be used
to explore information in groups of data from specific areas of knowledge. The following
applications are available: QUALITY CONTROL, TEACHING TOOLS, INDICES and
DNA-MICROARRAY ANALYSIS. The TEACHING TOOLS application is oriented
toward providing classical elements for teaching and learning applied statistics. Some tools
frequently used in statistical quality control are found in the QUALITY CONTROL
application. Under the INDICES item, the user can calculate numerous biodiversity indices
commonly used in Ecology. In the DNA-MICROARRAY ANALYSIS application,
procedures for normalizing, transforming, filtering, grouping and ordering genes, ordering
micromatrices, correcting the p-value to control for false discovery ratios (FDR), and testing
p-values are available, among others.
When an option in any of the menus shows up in gray instead of in black, this indicates that
the menu is not enabled. This could be because the user has not completed a previous step
necessary for that action, or because the action is not available in the acquired version of
InfoStat.

Data Management

Data Management
InfoStat processes information proceeding from a table. A table is defined as a group of data
organized in rows and columns. The columns usually represent the variables while the rows
usually represent the observations. Column labels are the names assigned to variables.

File
The actions (submenus) applied to the management of tables in the FILE menu are the
following:
NEW TABLE, OPEN, SAVE TABLE, SAVE TABLE AS... , and CLOSE TABLE. Also
available in this window are an EXIT option and a list of the most recently modified files.

New table
FILE menu NEW TABLE creates a new table. The user can also press <Ctrl+N>
or use the button with the blank sheet found on the toolbar (New Table button). A table with
one row and two columns will appear, and these can be expanded in order to enter data.
New tables are numbered consecutively (New table, New table_1, New table_2, etc.).

Open table

FILE menu OPEN , invokes an existing table. The user can also press <Ctrl+O>
or use the button with the picture of a file (Open Table button) on the toolbar. By pressing
<Shift>+ Open Table button, the user can directly access the Data file which contains the
files used in the examples in this manual. In order to open a table, the user should provide
the information solicited in the dialogue window.
InfoStat allows users to open files with the following formats:

InfoStat (*.IDB, *.IDB2)


Text (*.TXT, *.DAT)
InfoGen (*.IGDB)

Excel (*.XLS)
Dbase (*.DBF)
Paradox (*.DB)

Graph (*.IGB)
Results (*.ITRES)
EpiInfo (*.REC)

InfoStat assumes that in the data structure, columns represent variables and rows represent
observations. For each variable, every value should correspond to the same data type
(whole, real, categorical or date).
5

Data Management

If the user wishes to open an ASCII file with a TXT or DAT extension, the Import text
window will be activated.

By using the Import text window the user can indicate the Field separators he wishes to
use (tab, comma, semicolon, space or others). The data to be imported may contain the
names of the variables (columns). If the data contain the names of the columns, the user can
indicate whether what appears in Row 1 will be the name of future columns in the data table
(InfoStat shows this option by default). If the heading has text before the names of the
columns, the user should indicate which line contains the names of the columns. This can be
done by changing the number thats on the side of the Row 1 option, until the line with the
names of the columns is shown in the first row. If the data do not contain the names of
columns, the option Use first row as column name should be deselected. In this case, the
variables will be headed as Column 1, Column 2, etc. In order to observe the information
that will make up the table once it is imported, press the Preview table button. If the
structure is correct, press Accept, otherwise, change the options and try again with Preview
table, until the desired result is obtained.

Data Management
Note: When data tables that have been saved as text (with .TXT extension) are imported from
Microsoft Excel, the empty cells in the original file appear as two consecutive separators. In this case,
the option Consecutive separators are generated as one should not be selected. By default, InfoStat
shows this option as unselected when a text file is opened. If, however, the file contains numeric and
alphanumeric data in a single column, InfoStat only recognizes the first character in the column. If it
is a number, the alphanumeric characters will be erased and vice versa. The simplest way to read files
from another program is by using the Copy and Paste functions. InfoStat provides the options Copy
with column name and Paste with column name to facilitate the importing and exporting of data.
For example, in order to import an Excel file, the user should simply copy the data he wishes to
export to InfoStat, including the names of the columns from Excel. The user should then open a new
table in InfoStat, where he should paste the copied content by using the option Paste with column
name.

Table toolbar
By positioning the cursor over a table and right clicking the mouse, several options become
available, including the Toolbar. This option allows the user to add a bar of buttons to an
active table, such as the one shown below.

These buttons allow the user to do the following, from left to right: increase font size,
reduce font size, eliminate decimals (the user should first click on a cell of the column of
interest), add decimals (the user should first click on a cell of the column of interest), insert
a row (before a previously selected column), eliminate a previously selected row, add a
column to the end of a table, insert a column (before a previously selected column),
eliminate a previously selected column, and highlight a selection.
The font size can also be modified by pressing Ctrl and (to increase the size) or Ctrl and
(to decrease the size).

Variable management
This window appears when an active table is open and the user presses <Ctrl+E>. The
following actions are available in the dialogue box:

Data Management

Rename variables: This can be done by double clicking on a variable name in a list of
variables.
Move the position of one or more variables: The variables can be selected from the list,
and by pressing <Ctrl>, the selected block can be moved by using the arrow buttons (
moves it up and moves it down). Changes in the position of the list are automatically
updated in the table.
Select one or more variables to be eliminated: Once the variables are selected from the
list, click on the Mark to eliminate button. The variable will be eliminated from both the
list and the table.
Deactivate / activate one or more variables: When the check box to the left of the label is
unchecked, the variable is deactivated. (In the example, all the variables with a 1 in the
label are activated and selected.) The deactivated variables do not appear either in the table
or in the variable selector.
Forming groups of variables: Groups of variables can be formed by selecting the variables
and pressing the Group selection button. Variables in a group can be activated or deactivated, colored, erased, etc. all together.

Save table
FILE menu SAVE TABLE, saves the active table in InfoStat format (with .IDB2
extension), in the directory in use. The same can be achieved by pressing <Ctrl+S>, or the
Save active table button on the toolbar.

Data Management

Save table as
FILE menu SAVE TABLE AS, saves the active table with the appropriate format and
directory required by the user. The formats are listed below:

InfoStat (*.IDB, *. IDB2)


ASCII (*.TXT)

Excel (*.XLS)
InfoGen (*.IGDB)

Dbase (*.DBF)
Paradox (*.DB)

The Export table button on the toolbar can also be used.

In the dialogue box, indicate the name, place and type of file. If an ASCII format is selected,
the user should select a field separator and indicate whether the first row should be used as
the name of columns (labels). If desired, the user can also indicate whether a character (or
group of characters) should identify a missing observation in the exported file.

Close table
FILE menu CLOSE TABLE closes the active table. Alternatively, the user can press
<Ctrl+W>. If the table has been modified and has not been saved, InfoStat will ask the user
to confirm whether he wishes to save it.

Edit
The actions (sub-menus) that can be applied to the management of InfoStat tables in the
EDIT menu are the following: Cut, Copy, Paste, Copy with column name, Paste with
column name, Undo and Select all. The actions are used to edit cells, columns and/or rows,
similar to the editing of texts in Windows.

Data Management

Modifications to entered data in an InfoStat table are done from the active table. By pressing
<Enter>, the entered characters will be uploaded to the table. By pressing the <Esc> button
before pressing <Enter>, cell content that was previously uploaded will be re-established.
To stop editing, use the arrow buttons (up, down, left, right), the tab, or select another cell
with the mouse.

To select a group of cells, use the mouse to select the desired area. Alternatively,
select cells by using the keyboard, keeping the <Shift> key pressed and using the arrow
buttons to select the desired area. The highlighted areas can be printed by pressing the
Print button, found on the toolbar.

It is possible to select the font type, style, size and color for the entire table. This can
be done by simply selecting a cell and pressing the button with the letter A on the toolbar
to obtain the appropriate menu for this action. Buttons for the alignment of data to the right,
left, and center of the column also exist. These are located next to the A button.
In tables with .IDB2 format, a description of data contained in the table can be saved. The
description can be edited by pressing F2. When F2 is pressed, a field for writing the
description appears. If the second button on the toolbar of the dialogue window is pressed,
this field will be inserted in the file. If the user wishes to definitively include the description
in the data file, he should save the table.

10

Data Management

A description can be uploaded from a file with TXT or RTF format by pressing the first
button on the mentioned toolbar.

Data

11

Data Management

The actions (submenus) applied to the management of InfoStat tables in the DATA menu
are the following: New row, Insert row, Delete row, Deactivate case, Activate case,
Invert selection, Select cases, New column, Insert column, Delete column, Edit
categories, Edit label, Read labels from, Data type, Alignment, Decimals, Variable
manager, Categorize, Fill, Generate a class-variable according to cell color, Adjust
column width, Sort, Transformation, Create dummy variables, Formula, Search,
Sampling-Resampling, Color selection, Merge tables, Rearrange columns, one under
the other, Rearrange rows as columns, Create new table using active cases, make a
new column by merging categorical variables, Split a category in its components,
Update, Show-edit data table description.
These actions can also be invoked by right-clicking the mouse when positioned on the data
table.
The following example illustrates some of the actions executed by the submenus.
Example 1: The user has access to a group of observations that refer to seed size (Size),
color of episperm (Episperm), percentage of germination (PG), number of normal plantules
(NP) and dry weight (DW) of Atriplex cordobensis seeds, a foraging shrub. The data are
located in the file Atriplex.idb (courtesy of Dr. M.T. Aiazzi, of the Faculty of Agricultural
Sciences, U.N.C.).
Note: Files used in this manual are located in C:\Program Files\InfoStat\Data.

New row
DATA menu NEW ROW adds the number of rows specified by the user in the emerging
window to the end of the table. Alternatively, the user can position the curser on the last row
and press <Enter> to generate new rows.

Insert row

DATA menu INSERT ROW inserts a new row above the selected row.

Delete row
DATA menu DELETE ROWS eliminates the selected row(s) from the table. This action
can be undone by using the Undo submenu from the Edit menu.

Deactivate case
DATA menu DEACTIVATE CASE allows the user to exclude selected rows from the
procedure to be executed. To deactivate a row in the table, the user should double click on
the case number. Deactivated observations show their case number inside parentheses and
the corresponding row is colored.

12

Data Management

Activate case
DATA menu ACTIVATE CASE activates cases that have been deactivated (i.e.,
activated cases participate in the analysis). To activate a single row, the user should double
click on its case number. To simultaneously activate several cases, the user should select a
cell from each row to be activated and activate them from the DATA menu or from the
menu that appears by right clicking the mouse. All selected cases are activated by default.

Invert selection
DATA menu INVERT SELECTION activates (deactivates) cases that are deactivated
(activated).

Choosing cases
DATA menu SELECT CASES... allows the user to establish criteria for selecting cases.
Once the action is executed, unselected cases are deactivated. First, the user should establish
to which variables the selection criteria will be applied, then specify the criteria.

In the Select cases dialogue window, a list of variables from the active table appears. From
this list, the user should select the variables to which the selection criteria will be applied,
entering these in the corresponding box on the Variables tab (a partition can be indicated in
the corresponding tab).
Procedures that facilitate the selection of variables are available when many variables are
used. At the foot of the list of variables, there are options to select variables according to a
particular common characteristic in their names. If the variables share a specific character or
13

Data Management

succession of characters, they can be simultaneously selected. The figure illustrates the
selection of all the variables whose names contain the letter P, once the option () box has
been activated. To specify that the character or succession of characters is at the beginning
of the label, activate the option [) box; to indicate that its at the end of the label, activate
the option (] box. Wildcard characters can also be used. For example, by entering the
sequence **1, all variables whose labels have 2 characters before the number 1 will be
selected from the list. If ??1 is entered, all variables whose labels contain a 1 preceded
by two alphabetical characters will be selected, and if ##1 is entered, all variables whose
labels contain a 1 preceded by two numerical characters will be selected.
If groups have been formed (using the Variable manager window), the box labeled {g}
becomes available. By activating this box, a field that contains the list of available groups
appears, from which the groups can be selected.
Another way to select variables is to use a list saved in a text file. In so doing, all the
variables contained in the file will be selected. In order to do so, the user should right click
on the box that contains the list of variables of the active table. A menu appears in which the
Select from a list option appears, followed by the Text file option. In this same menu, there
is an option for alphabetically ordering the list of variables.

Once the variables have been selected, criteria for selecting the cases should be established.
The variables that participate in the selection process appear in the dialogue box, and there
is a field for writing the criteria. In the case that a criterion is established based on more than
one variable, the user should select one of the variables, write the sentence that indicates the
criterion, for example x<80, and then press Enter. The user should proceed in the same way
with each variable of interest. By pressing Accept, the cases outside of the selection appear
deactivated (colored and with their case number in parentheses), in the active table.
14

Data Management

More than one sentence can be written to determine the criterion for a single variable. This
can be done by pressing Enter after writing a sentence.
By activating the Create new table using active cases, a table with the selected cases is
generated.

New column
DATA menu NEW COLUMN adds a new column to the end of the table. The type
of format can be indicated (whole, real, categorical, or date). The added column is named
Column 1, Column 2, etc. By pressing the button with an image of a table, located in the
toolbar, new columns are added to the right hand side of the active table. Columns generated
in this way are not previously assigned a type. The type of data in these columns is assigned
automatically when content is uploaded to any of its cells. If the content is numerical, the
type assigned is real, if it is alphanumeric the assigned type is categorical. If the user wishes
the type to be whole, he should change it afterwards, starting from a column with real type
data.

Insert column
DATA menu INSERT COLUMN inserts a column in a place prior to where the cursor is
located. Data type (real, whole, categorical or date) can be indicated. Inserted columns will
be named Column 1, Column 2, etc.

Delete column
DATA menu DELETE COLUMN eliminates the selected column(s). The user need only
select one cell of each column. This action can be reversed by using the Undo submenu in
the Edit menu.

Note: to change the position of a column, select the column while pressing <Ctrl> and move the
mouse, while continuing to press down on the mouse button, to the new desired position. Upon
releasing the mouse button, the column will remain in its new location.

Edit Labels
DATA menu EDIT LABELS allows the user to change the name of a column. The user
need only position the mouse on a cell of the column he wishes to edit and solicit this
action. Acceptable names include spaces and ASCII characters, with a limit of twenty
characters. If the name begins with a number, InfoStat will add the letter C beforehand. By
selecting several columns and applying this action, a dialogue window that allows the user
to successively change column names appears.
In files generated with an IDB2 extension, double clicking on the edit field where the name
of the variable is written makes a dialogue appear that allows the user to write a description
15

Data Management

of the variable. If the user wishes to include the description in the file, the description should
be saved.

Read labels from


DATA menu READ LABELS FROM.... allows the user to read the names of variables in
an active table from a text file (*.txt). InfoStat assumes that the names are in a list (one
name beneath the other) in the order in which the variables are found in the table.

Data type
DATA menu DATA TYPE allows the user to declare the type of data in a column. The
following data types are acceptable: whole, real, categorical, and date. Dates can be entered
in the following formats: 20/05/07, 20-05-07 or 20.5.07.
If the user does not declare a data type, InfoStat assigns the type that corresponds to the first
data entered. Once the type has been declared, only data of the same type can be entered.

Alignment
DATA menu ALIGNMENT changes the position of the presentation of the content in the
selected cells. Alignment positions include left, center and right. The default alignment for
numerical cells is right, and for categorical cells the default is left. There are also buttons to
complete the alignment action, found on the tool bar next to the A button.

Decimals
DATA menu DECIMALS changes the number of decimal places included in the
numerical content of the cells. Up to 10 decimal places are allowed. By default, 2 decimal
places are included. When data are copied from the grid only visible decimals are taken into
account, thus it is important to specify the desired number of decimals for each variable.

Automatically adjust columns


DATA menu ADJUST COLUMN WIDTH (<Ctrl+L) adjusts the width of selected
columns according to the length of the column labels or to cell content. If no column is
selected, the action will be applied to all the columns of the table.

Sort
DATA menu SORT allows the user to sort records in ascending or descending order of
the values in one ore more columns. A dialogue window shows the names of columns of the
active table in a list on the left. On the right, two lists, ascendant order and descendent order,
show the variables to be sorted according to the hierarchy determined by the user and the
order in which the variables were selected. For example, if the file has two columns, gender
and age, where the gender variable comes first in ascending order group, and the age
16

Data Management

variable comes second in descending order group, by performing the sorting action, the file
will be ordered by gender, and within each gender, it will be sorted in descending order by
age.
The buttons found on the lower part of the dialogue window allow the user to change the
sorting criteria (ascending or descending) and the sorting hierarchy.
For example, using data from the Atriplex file, observations were sorted in descending
order, according to the values of the variable PG. The resulting configuration is shown in
the following table:

Table 1: Atriplex file sorted in descending order by variable PG.


Size

Color

Germination

medium
big
medium
medium
small
big
medium
.
.
small
medium

reddish
yellow
yellow
yellow
reddish
yellow
yellow
.
.
dark
dark

100
93
93
93
93
87
87
.
.
20
13

Normal
seedlings
80
80
80
80
7
87
54
.
.
0
7

DW
0.0032
0.0040
0.0038
0.0043
0.0030
0.0043
0.0033
.
.
0.0030
0.0030

Alternatively, sorting can be invoked from the toolbar by activating the Sort icon.
Warning: this option cannot be automatically undone. To keep the original file, close the table
without saving changes, save the file with another name, or sort in such a way as to recover the
original order of the data.

17

Data Management

Categorize
DATA menu CATEGORIZE allows the user to categorize data from a previously
selected column while generating a new column with the desired categorization. This action
is available only when the data in the selected column are whole or real. Two procedures are
available: assign categories to intervals or assign categories assign categories to numeric
codes.
By selecting assign categories to intervals, categories are made by setting the upper limits
of a group of class intervals. Cases that belong to the same class are assigned to the same
category. The following categorization methods are defined, depending on the way in which
class intervals are established:
FIXED: categorizes a data group, generating as many intervals as solicited categories.
Minimum and maximum valies, length, and upper limits for each category are shown,
identified as C1, C2, etc. If the user wishes to identify each category with whole numbers,
he should activate the Numbers box. By default the categories are sorted in ascending
order; to change this, the Descendent box should be activated.
To execute the categorization, press the Accept button. The user can change Minimum and
Maximum values to obtain the desired categorization.
PROBABILISTIC: the upper limit of each category represents a percentile of the
distribution of the variable, according to the number intervals solicited. For example, if 4
intervals are solicited, their respective limits are the 25, 50, 75 and 100 percentile. To apply
the categorization, press the Accept button.
CUSTOMIZED: the upper limit of the intervals of each category can be entered. To do so,
the user should select the number of categories that he wishes to create and enter the upper
limit of each interval in the adjacent table. By default, the upper limit of the last category is
the maximum value of the observed values. To apply the categorization, press the Accept
button.
As an example, using data from the Atriplex file (previously sorted by the variable PG, in
descending order), observations were categorized by intervals. The resulting configuration is
shown in Table 2. Using the FIXED option, the pre-establised configuration was selected:
N categories: 5; min: 13; max: 100; length of interval: 17.4; upper interval limits: 30.4;
47.8; 65.2; 82.6; 100. Using the PROBABILISTIC option, 5 categories were selected with
the following upper limits: 33; 60; 80; 87 y 100. Using the PERSONALIZED option, two
categories were selected: one with germination values less than or equal to 80%, specified
by writing the number 80 in the LS1 field, and another with values greater than 80%,
specified in the LS2 field where the number 100 appears by default.

18

Data Management
Table 2: Atriplex file with the variable PG categorized according to three criteria.
Germ.
100.00
93.00
93.00
93.00
93.00
87.00
87.00
87.00
87.00
87.00
80.00
80.00
80.00
73.00

Fixed
C5
C5
C5
C5
C5
C5
C5
C5
C5
C5
C4
C4
C4
C4

Proba.
C5
C5
C5
C5
C5
C4
C4
C4
C4
C4
C3
C3
C3
C3

Pers.
C2
C2
C2
C2
C2
C2
C2
C2
C2
C2
C1
C1
C1
C1

Germ.
73.00
66.00
60.00
60.00
53.00
53.00
40.00
33.00
33.00
26.00
20.00
20.00
13.00

Fixed
C4
C4
C3
C3
C3
C3
C2
C2
C2
C1
C1
C1
C1

Prob.
C3
C3
C2
C2
C2
C2
C2
C1
C1
C1
C1
C1
C1

Pers.
C1
C1
C1
C1
C1
C1
C1
C1
C1
C1
C1
C1
C1

Upon selecting assign categories to numeric codes, the categories can be read from a table
or entered by the user. This process is useless, for example, in the case of a file that uses
numeric coding to represent the different states of qualitative variables. The corresponding
dialogue window is shown below.
In the dialogue window, the list of numbers to be
categorized appears on the left, and on the right appears
an empty list of categories. The categories can be
entered manually or read from a text file or table stored
on the clipboard. The text file should contain as many
lines as categories, and each line should have a number
followed by a separator symbol (this can be =, :, .
or a tab), followed by the name of the category
associated with this number. For example, if upon
registering the type of occupation the number 2
corresponds to the category unemployed, this should
appear as follows: 2=unemployed. If the option for
assigning categories based on a table stored on the
clipboard, this table should have been previously copied
from a file that includes a description of the structure for
the text files. These uploading options are selected from a menu that appears by right
clicking the mouse on the assignment table, as shown in the figure. In order for the option
Copy from clipboard to appear, the table should be on the clipboard.
To obtain the categorization, press the Accept button. Categories will appear in a new
column with a label with the prefix Cat followed by the name of the variable that
corresponds to the categorization. The figure shows an edit field in which Cat_Occupation
appears, which can be modified by writing a new name.
When a numerical variable is categorized using an assignment table, the table can be read
from the description of the resulting variable.
19

Data Management

Edit categories
To apply this action, the column that contains the categories should be selected. DATA
menu EDIT CATEGORIES makes a dialogue window (Edit categories) appear that
shows the categories of the selected variable (column). In this window, a list with existent
categories will appear. Upon selecting a category, its name will appear in an editing field
located above the list. In that field, the name of the category can be modified. This field is
automatically shown in the list. By pressing the Accept button, changes will be reflected in
the data table.
A category can be grouped with another one by using the arrow buttons: Upper limit and
Lower limit. If a category is selected and the right arrow button is pressed (Lower limit),
the selected category will be included within the category that precedes on the list. Upon
pressing the Accept button, the included category will disappear from the data table and it
will be replaced by the category in which it is included. Another way to include a category
within another is to select it with the mouse, and while keeping the mouse button pressed,
drag it to the category that is to contain it. If a category is incorrectly placed within another
category, the user can re-locate it by dragging it to the category where he wishes it to be
included. Before pressing the Accept button, the action can be reversed by selecting the
included category and pressing the left arrow button (Upper limit). To change the position
of the categories, the up () and down () arrows can be used. Once the user is satisfied
with the categorization, he should press the Accept button so that the changes are reflected
in the data sheet.
In order to facilitate entering data for categorical variables, each category is associated with
a number that depends on its position in the list that appears in the Edit categories dialogue
window. For example, if the categories are small, medium and large and they appear
in the list in that order, by entering 1 in one of the cells of the column that contains these
categories and pressing <Enter>, the name small will appear. If the order of the categories
in the list is altered, the numeric coding will respond to the new order.
If a variable is changed from categorical to whole, numbers that correspond to the order of
the category in the list will be generated.
The button shown in this paragraph can be found in InfoStats toolbar, which
allows the user to edit categories without going to the DATA menu.

Transforming
By invoking this action, the Transformations window will appear, so that the user can
select the variable(s) he wishes to transform. These should be quantitative variables. Upon
pressing the Accept button, another window that allows the user to select the transformation
appears. In this window, two lists of transformations appear: one to be applied to a variable
and another to be applied to a combination of variables. Regardless of which transformation
is selected, InfoStat generates new columns containing the transformed variables, which will

20

Data Management

automatically be named with the name of the transformation followed by an underscore and
the name of the original variable.
Selecting the transformatio: possible transformations including the following
Standardize, Standard (by row), Center, Center (by row), Externally studt res
(externally studentized residuals), Rank, Normal score, Log10 (base 10 logarithm), Log2
(base 2 logarithm), Ln (natural logarithm), Square root, Inverse, Power, ArcSin (square
root (p)), Probit, Logit, Complement log-log, Map to [0,1], if >= mean then 1 else 0, if
>= median then 1 else 0, Multiply by, and Scale by the maximum. If two or more
variables are selected, other transformations that appear in the Combining variables list can
be executed.
Standardize: allows the user to standardize the selected variable(s). The standardization is
done by extracting from each observation the mean of the column and dividing this by the
standard deviation of the values of the column.
Standardize (by row): if the user selects more than one variable in the transformations
menu, the standardize by row option becomes enabled. In such cases, each entry in the
table is transformed to its standardized value using the mean and standard deviation of the
elements in the corresponding row.
Center: this transformation centers by column. In other words, from each observation,
InfoStat subtracts the mean value of the variable, obtained using data from the
corresponding column.
Center (by row): in this case, from each value of a selected variable, InfoStat subtracts the
row mean, obtained using data for all the selected variables.
Externally studt res (externally studentized residuals): for a position medel, define:

ERS
=

( yi y ( i ) )

S ( i )

where yi is the value of the discarded observation, y ( i ) is the mean of the data without the
observation yi , and S(-i) is the standard deviation of the data calculated after the observation
is discarded.
Rank: this function assigns the position occupied in the ascending list to the original data.
In a group of n data, the observation with the lowest position is assigned rank 1, the one
with the second lowest position is assigned rank 2, and so on and so forth. The observation
with the highest position is assigned rank n. If two or more observations are assigned a singl
value (tie), the rank assigned to each observation is an average of the consecutive ranks
corresponding to that value.
For example, for the series 10, 20, 20, 30, 40, 50, 50, 50, 60 the transformed series is as
follows: 1, 2.5, 2.5, 4, 5, 7, 7, 7, 9.
Normal score: the Rank transformation is applied to the selected variable. Next, each
rank value is divided by (n+1), where n is the total number of data in the sample. For each
quotient, the inverse of a Normal (0:1) distribution function is obtained.
21

Data Management

Logarithm transformation:InfoStat allows users to generate variables using the Log10


(base 10 logarithm), Log2 (base 2 logarithm) and Ln (natural logarithm). If the value to be
transformed is less than or equal to zero, the result will be a missing value. In this case,
log(y+c) can be used, where c is a constant.
Square root:

y or

y + c , where c is a constant.

Inverse: 1/y.
Power: y with 0 where is the desired power.
ArcSin (square root (p)): Sen -1

( p)

with p [0,1] (arcsine of the square root of the

proportion).
Probit: defined as Probit (p)=F -1(p) with p (0,1), where F
distribution function.

-1

is the inverse of the normal

Logit: defined as Logit (p)=ln(p/(1-p)) with p (0,1).


Complement log-log: defined as CLL(p)=ln[-ln(1-p)] with p (0,1).
Map to [0,1]: given a group of observations {y1,...,yn}, the transformation consists in
subtracting from each value the minimum of {y1,...,yn} and divide the resulting value by the
rance (difference between the maximum and minimum).
If >= mean then 1 else 0: allows the user to dicotomize the data as a function of the mean
of the observations. Observations greater than or equal to the mean will take on a value of 1.
If >= median then 1 else 0: allows the user to dicotomize the data as a function of the
median of the observations. Observations greater than or equal to the median will take on a
value of 1.
Accumulate: generates a column where the t-th element represents the sum of the first t
elements. For example, if the column contains values 10, 12 and 20, applying this action
will generate the values 10, 22 and 42.
Scale by the maximum: Divide the selected columns by their maximum.
Divide by the sum: Divide the selected columns by their sum.
Fill with a sequence: Replece the non-missing values of the selected columns by a
sequence in the order of the registers.
Combining variables allows the user to apply functions that involve several columns in the
file. The variables that are to intervene in the evaluation of the selected function should be
specified in the variables selector. The selected function can be one of the following: Sum,
Mean, Median, Variance, Standard deviation, Minimum, Maximum, and Linear
combination. The Sum function sums the values of the selected columns in each row of the
file and generates a new variable named Sum. Similarly, the Mean, Median, Variance,
Standard deviation, Minimum, and Maximum of the values in each row can be solicited.

22

Data Management

When Linear combination is selected, the coefficients of the combination should be


indicated in the Coefficients window. The coefficients should be entered one by one,
pressing <Enter> after each entry. Thus, if there are two columns, say X and Y, and the
numbers 2 and 3 are specified in the coefficients window, a new column will be generated
called linear combination equal to 2X+3Y.

Create dummy variables


In some statistical applications, for example in those associated with regression models, it is
necessary to transform a categorical variable X with k categories in k-1 binary variables
(with value 0 or 1). A binary variable of this type is known as a dummy (auxiliary or
indicator) variable. The group of k-1 dummy variables is used to identify each of the
categories of the original variable X. Thus, if, for example, X has k=3 categories, two
dummy variables D1 and D2 will be enough to represent each of the categories of X. For
example, the combination D1=1 and D2=0 can identify the first category; D1=0 and D2=1
can identify the second category; D1=0 and D2=1 can identify the third category. In this
case, the third category (that one in which all dummy variables equal zero) is generally
called the reference category.
To generate dummy variables, select the original categorical variable, and upon pressing
Accept, a Dummy variable generator will appear on the screen where the original
variable(s) and available categories for each one of these will be listed. The first category
will be automatically selected to be used as a reference category. If the user wishes another
category to serve this purpose, he should move the cursor to the desired category in order to
select it. InfoStat generates k-1 dummy variables, which will be added to the data table,
which are identified by the original variable name followed by an extension, so that it may
be differentiated.
The option Multiply by which appears in the Dummy variables generator screen can be
used to obtain the product of a dummy variable and some other variable of interest. These
products will be shown in new columns in the data table, with a name that indicates their
origin. An example of the application of this option is available in Regression with dummy
variables.

Fill...
This option automatically fills a group of selected cells according to the specified option. To
fill cells, select the desired cell(s) and specify the distribution from the DATA FILL...
menu.
Warning: these actions replace the values of the selected content, thus if the user wishes to preserve
the content of the original column, he should copy the column and apply the distribution to the new
column.

23

Data Management

Downward
The empty cells are filled with the content of the first filled cell that precedes the empty
cells in that same column. This action can also be completed by pressing CTRL+D.

With sequence
Beginning with the first selected cell, selected cells are assigned a natural number, in
ascending order. The numbering continues on the columns on the right, and does not re-start
with the new column.

With uniform (0,1)


Upon selecting this option the selected cells are assigned the value of a continuous random
variable with uniform distribution, between 0 and 1.

With Standard normal (0,1)


Upon selecting this option, selected cells are assigned the value of a random variable with
according to a standard normal distribution with mean = 0 and variance = 1.

Others...
In order to generate an ample list of distributions of random variables, InfoStat allows the
user to fill cells with the following: 1) realizations of the random variable, 2) a cumulative
distribution function for arguments read from the selected cells, 3) an inverse distribution
function evaluated according to the selected values, and 4) a probability function
evaluated according to the
selected values.
The following distributions are
available:
Uniform,
Normal,
Student T, Chi square, Non central
F, Exponential, Gamma, Weibull,
Logistic,
Gumbel,
Poisson,
Binomial,
Geometric,
Hypergeometric and Negative
binomial.
The Sequence (begin, step) option
is also available, and it can be
used to fill cells with a sequence
of real numbers, where the user
defines the beginning and the
distance between two consecutive
numbers in the Parameters (begin
and step) subwindow that is
activated upon selecting Sequence
(begin, step). For example, if the
24

Data Management

beginning number is 1 and the step is 2, the selected column will begin with 1 and will
continue with 3, then with 5, and so on.
To fill cells with realizations of the random variable, cumulative distribution function,
inverse distribution function, or probability distribution of one of the available random
variables, select the random variable and in the Parameters panel, specify the constants that
characterize the selected distribution.
Select seed: By default, InfoStat uses a random seed to generate random numbers; however,
in some cases it is useful to generate a single random sequence. This can be done by
specifying a single randomly selected number, not equal to zero, in the edit field that is
activated when the Select seed button is pressed. If the number zero is specified as the seed,
InfoStat assumes that the seed is random, and therefore the sequences will always be
different.
A brief description of the available distributions is shown below:
Note: E(X) and V(X) indicate the expected value and the variance of the random variable (X),
respectively.

Uniform (a,b): A continuous random variable X has a Uniform distribution on the interval
[a,b] if its density function is as follows:
f ( x; a , b ) =

I
( x) ,
b a [ a ,b ]

where I[ a,b] ( x ) is the indicator funciton, and the parameters a and b satisfy -<a<b<.
E(X)=(a+b)/2 and Var(X)=(b-a)2/12.
Normal (mean, variance): A continuous random variable, X, with -< x<, has a Normal
distribution if its density function is as follows:
f ( x; m , v ) =

1
2 v

2
e ( x m ) / 2v

where the parameters m (mean) and v (variance) satisfy -< m< y v>0. InfoStat uses m
and v to represent the parameters E(X)= y Var(X)=2, respectively.
Student-T (v): The continuous random variable X (with -<x<) has a Student-T
distribution with v degrees of freedom if its density function is as follows:

f ( x; ) =

( + 1) / 2

( / 2)

(1 + x 2 / )(

+1) / 2

where v is a whole positive number known as degrees of freedom, and (.) is the gamma
function with the following form:
25

Data Management

(r ) =
y r 1 e y dy
0

E(X)=0 for degrees of freedom greater than 1, and V(X)=/(-2) for >2.
Chi square (v, lambda) (non-central): The random variable X has a Chi square distribution
if its density function is as follows:

f ( x; , ) =
j =0

e x( + 2 j 2) / 2 e x / 2


j ! + 2 j 2 j + ( / 2)
2

I( 0, ) ( x)

where I(0, ) ( x ) is the indicator function, is a whole positive number that denotes degrees
of freedom, (.) is the gamma function, and 0 known as the non-central parameter and
defined as j=1 when =0, j=0.
E(X)=+2 and V(X)=2(+4). If =0, the distribution is central Chi square.
F non-central (u, v, lambda): The continuous random variable X has a non-central F
distribution, characterized by degrees of freedom u (degrees of freedom of the numerator)
and v (degrees of freedom of the denominator), and by the non-central parameter, , if its
density function is as follows:

f ( x; u , , ) =

j =0

j e
j!

)( ) x
( ) ( )( )

2 j + u +

2j +u

(u + 2 j ) / 2

1+

ux

(u + 2 j 2) / 2

( u + + 2 j ) / 2

I(0, ) ( x)

where I(0, ) ( x ) is the indicator function, u and are whole positive numbers, (.) is the
gamma function, and 0, defined as j=1 when =0 and j=0. If =0, the distribution is F
central with E(X)=v/v-2 for v>2 and V(X)=2 v2(u+ v-2)/u(v-2)2(v-4) for v>4.
Exponential (lambda): The continuous random variable X has an Exponential distribution
if its density function is as follows:
f

( x; ) = e x I(0,) ( x)

where I(0, ) ( x ) is the indicator function and >0. E(X)=1/ and V(X)=1/2.
Gamma (r, lambda): The continuous random variable X has a Gamma distribution if its
density function is as follows:
f ( x; r , )
=

26

r
(r )

x r 1e x (0, ) ( x )

Data Management

where I(0, ) ( x ) is the indicator function, r>0 and >0, and (.) is the gamma function.
E(X)=r/ and V(X)=r/2.
Beta (a, b): The continuous random variable X has a Beta distribution if its density function
is as follows:
; a, b)
f ( x=

1
B ( a, b)

x a 1 (1 x )b 1 (0,1) ( x )

where (0,1) ( x ) is the indicator function, a>0, b>0 , and B(a,b) is the beta function given by
the following expression:
2

B ( a , b=
) x a 1 (1 x )b 1 dx

para a > 0, b > 0

E(X)=a/(a+b) and V(X)=ab/((a+b+1)(a+b) 2).


Weibull (a, b): The continuous random variable X has a Weibull distribution if its density
function is as follows:
=
f ( x; a , b ) abxb 1e axb (0, x ) ( x )

where (0, x ) ( x ) is the indicator function, a>0 and b>0. E(X)=(1/a)1/b (1+b-1) and
V(X)=(1/a)2/b [(1+2b-1)-2(1+b-1)], and (.) is the gamma function.
Logistic (a,b): The continuous random variable X has a logistic function if its cumulative
density function is as follows:
F ( x; a , b =
) 1 + e( x a ) / b

where -< a< and b>0. E(X)=a and V(X)=(2b2)/3.


Gumbel or extreme value (a,b): The continuous random variable X has a Gumbel
distribution if its cumulative density function is as follows:
F ( x; a , b ) = exp ( e ( x a ) / b )

where -<a< y b>0. E(X)=a-b where approaches 0.577216 and V(X)=(2b2)/6.


Poisson (lambda): This distribution provides a model for count-type variables where the
counts refer to the number of events of interest in a unit of time or space (hours, minutes,
m2, m3, etc.). A discrete random variable X has a Poisson distribution if its density function
is as follows:
f ( x; ) =

ex x
x!

[ 0,1,...] ( x )

where I[ 0,1,...] ( x ) is the indicator function and >0. E(X)= and Var(X)=.

27

Data Management

Binomial (n, p): This distribution occurs when the following conditions are simultaneously
present: a) Bernoulli trials are executed, b) the parameter p (probability of success) is
constant between trails, and c) trials are independent of each other.
Bernoulli distribution: in some experiments, there are only two possible results: success or failure,
presence or absence, yes or no, etc. A Bernoulli variable is a binary variable that identifies these
events. For example, x=1 may represent success and x=0 may represent failure. E(X)=p and
V(X)=p(1-p), where p is the probability of success.

A discrete random variable X is said to have a Binomial distribution if its density function is
as follows:

n p x q n x I

[0,1,..., n] ( x )
x

f ( x; n , p ) =

where I[0,1,...,n] ( x ) is the indicator function, 0p1, q=1-p and n=1,2,... is the total number
of trials. E(X)=np and Var(X)=npq.
Geometric (p): This distribution is of special interest in modeling the number of trails
needed for the first success to occur. A discrete random variable X has a Geometric (or
Pascal) distribution if its density function is as follows:
f ( x; p ) = p (1 p )

I[0,1,...] ( x )

where I[0,1,...] ( x ) is the indicator function, 0p1, and q=1-p. E(X)=q/p and Var(X)=q/p2.
Hypergeometric (m,k,n): This distribution is associated with situations in which there is
sampling without replacementthat is, situations in which an element of the population is
randomly selected, and so on and so forth, until the trail is complete without substituting the
extracted elements. Let a population be a group of m elements, k of which are in one of two
possible states (success) and m-k of which are in the other state (failure). Similar to the
Binomial distribution, the problem of interest is to find the probability of obtaining x
successes in a sample of size n. A discrete random variable X has a Hypergeometric
distribution if its density function is as follows:

k mk
x n x
I
f ( x; m , k , n ) =
[0,1..,n ] ( x )
m
n

where I[0,1,...] ( x ) is the indicator function, m=1,2,..., k=0,1,...m and n=1,2,...,m.
E(X)=n(k/m) and Var(X)=n(k/m) ((m-k)/m) ((m-n)/m-1).
Negative binomial (m,k): As in the repetition of Bernoulli trials, certain problems, common
in studies of natural populations, concentrate on the probability of finding x individuals in a
simple unit under study where the individuals tend to be aggregated (Contagious
distribution). InfoStat allows the user to calculate those probabilities by means of the
28

Data Management

Negative binomial distribution. A discrete random variable X has a Negative binomial


distribution if its density function is as follows:
1 ( k )( k +1)( k + 2 )...( k + x 1) p x
I[ 0,1,...] ( x )
x!
q

f ( x; m, k ) =
qk

where I[0,1,...] ( x ) is the indicator function, p=m/k and q=p+1. The parameters m and k
satisfy the following conditions: m>0 (average number of individuals per sampling unit) and
k>0 (contagious or aggregation parameter).

Formula
It is possible to specify a formula whose results can substitute the content of an existing
column or can be added to a new column.
Warning: the names of the variables used in the calculation should not have parentheses,
mathematical operation symbols or names of reserved functions, but they can contain accent marks
and ees.

The dialogue window is shown below:

During a work session, the formulae are stored in a list as they are written, and they are thus
available for future use. To visualize them, the user should right click on the field in which
the formulae are written.
The dialogue window shows a list of available variables which can be included in a formula
by clicking on the name of the list. When this procedure is followed to add variables to the
expression that is being written, the names appear in quotes. This allows the user to include
names that contain spaces or mathematical symbols that should not be interpreted as such.

29

Data Management

The user can either used predefined functions or he can define his own functions. In the
latter case, the user should write the function in the panel that appears below the formula
edition field. For example, the function cube(x) is not a predefined function, but it can be
specified by the user in the User defined functions panel by writing: cube(x)=x*x*x. This
definition will allow the user to apply the cube function to any other variable in the active
table or to any other valid expression. By writing in the formula specification field, for
example, h=cube(COLUMN2), the cube function will be applied to the data in column 1.
If the variables involved in the formula have a very long name, these names can be
substituted in the formula with %#, where # is the number of the column that holds the
variable. For example, if the data table has 3 columns, %1 denotes the name of the first
column, %2 denotes the name of the second column, and %3 denotes the name of the third
column. To identify the correspondence between column name and number, press the Alt
key. While this key is held, the names of the columns in the active table will be shown as
%#.
If the user wishes to apply a function such as mean(.), min(.), max(.), which accept multiple
arguments, to a block of variables, he should use the notation f(%a:%b), where f denotes the
function, and %a and %b denote the column number of the beginning and end of the block,
respectively. Note that the character that separates the beginning and end of a block is
colon (:). Continuing with the above example, in order to calculate the average aof the
first 3 variables in the file, the following should be indicated: (%1:%3). Another way to
indicate that the function should be applied to a group of variables such as, for example
mean (), is to use the format mean (name variable1: name variableN) indicating that the
mean of all the variables between the first and nth variable is desired. This expression can
be written manually or automatically, by selecting the block of variables in the list of
variables.
IDB2 data tables save the formulae that generate the contents of a column. It is possible to
update the content of a column by applying the formula again. To do so, the column should
be selected, and then the Update option should be chosen from the Data menu or from the
menu that appears upon right clicking on the mouse. The dialogue appears in Macros mode,
with the corresponding formula (or formulae, if more than one column was selected). These
formulae can be edited or executed, individually or jointly, to update column content.
Modifications can be conducted from the data table, while keeping the formulae window
open.
To specify a formula, select DATA FORMULA and write, for example, the expression
Y=LN(COLUMN1)+3 in the window.
The following operators and functions are predefined in InfoStat:
+ : addition operator.
- : subtraction operator.
* : multiplication operator.
/ : division operator.

30

Data Management

^ : exponent operator (only positive numbers in the base).


( : open parentheses.
) : close parentheses.
e : constant 2.7172
PI: constant 3.141592653
SETSEED(x): Use this sentence with any integer as argument to set the random seed to a
given initial value.
ABS(x) : absolute value of x (Range of x: -1e4932...1e4932).
ARCCOSINE(x) or ARCCOSIN(x): arccosine of x.
ARCSINE (x) or ARCSIN (x):: arcsine of x.
AREAY(y1;;yn): Calculates the area under the curve defined by the ordered pairs (Y,X),
assuming that the values of X are equally spaced by one unit.
AREAYX(y1;x1;;yn;xn): Calculates the area under the curve defined by the ordered pairs
(Y,X).
ATAN(x): arctangent of x (Range of x: -1e4932...1e4932).
COSINE(x) or COS(x): cosine of x (Range of x: -1e18...1e18).
SQUARE(x) or SQR(x): square of x (Range of x: -1e2446... 1e2446).
STDEV(x1;x2;;xn): Calculates the standard deviation of the indicated variables.
DISTNORMAL(x;m;v): Calculates the cumulative probability up to x for a normal
distribution with mean m and variance v.
EXP(x): exponential e^x (Range of x: -11356...11356).
FACTORIAL(x): factorial of x.
GAMMA(x): Assigns values of the Gamma distribution to the values of the indicated
function.
INVNORMAL(p;m;v): Calculates the value of x such that P(X<x)=p with X~N(m,v).
LN(x): natural logarithm of x (Range of x: 0...1e4932).
LN2(x): base 2 logarithm of x.
LOG10(x): base 10 logarithm of x.
MAX(x1;x2;;xn): Calculates the maximum value of the indicated data group.
MEAN(x1;x2;;xn): Calculates the mean of the values of the indicated variables.
MEDIAN(x1;x2;;xn): Calculates the median of the values of the indicated variables.
MIN(x1;x2;;xn): Calculates the minimum value of the indicated data group.
MOD(x) : modulus (or remainder) operator (applicable only to whole numbers).
31

Data Management

NORMA(x1;x2;;xn): Calculates the norm of the vector x.


NORMAL(m, v): Generates realizations of a random, normal variable with mean m and
variance v.
ROUND(x): rounds x (Range of x: -1e9...1e9).
SQRT(x): square root of x (Range of x: 0...1e4932).
SINE(x) or SIN(x): sine of x (Range of x -1e18...1e18).
SUM(x1;x2;;xn): Sum of the values of the indicated variables.
TANGENT (x): Tangent of x.
TRUNC(x): takes the whole value of x (Range of x: -1e9... 1e9).
URN. Generates realizations of a random variable with uniform distribution.
UNIFORM(a, b): Generates realizations of a random variable with uniform distribution on
the interval (a, b).
VARIANCE(x1;x2;;xn): Calculates the variance of the values of the indicated variables.
ZRN: Generates realizations of a random variable with standard normal distribution. To
work with date type variables, the functions described below are available (the arguments
required by the function are in parentheses).
DIADELCICLO(date,day,month): this sentence generates a column that contains the day
of the cycle (on a scale of 1 to 365), according to the corresponding date and taking into
account that the cycle begins on the day and month specified in the argument. For example,
if in the formula field the user enters day=DIADELCICLO(date, 1,9), a column with the
name of the day that contains whole numbers between 1 and 365 is generated, each one
corresponding to the date indicated in the argument, where day 1 of the cycle is
September first. Thus, according to this example, if the date column reads 18/09/07, the day
column will contain the whole number 18; if the date column reads 03/10/07, the day
column will contain the whole number 03.
FECHADELDIADELCICLO(diadelciclo,day,moth,year): returns the date that
corresponds to the specified day of the cycle, according to the day, month and year that
correspond to the date of origin of the cycle. If the year argument is omitted, it takes the
present year.This function is the inverse fo the function DIADELCICLO.
DIAJULIANO(date): generates a column containing the julian day that corresponds to each
data read from the date column.
YEAR(date): generates a column containing the year that corresponds to each data read
from the date column.
MONTH(date): generates a column containing the month that corresponds to each data read
from the date column.
DAY(date): generates a column containing the day of the month that corresponds to each
data read from the date column.

32

Data Management

DATE(day, month, year): generates a column containing the date that corresponds to the
specified day, month and year.

Search
DATA menu SEARCH presents a dialogue window that allows the user to search for
numbers, categories or dates, equal to, greater than, less than and/or different from a that
specified by the user, within a part of the table that has been previously selected. These
values can be replaced by another, by activating the Replace box, excluded from the
analysis by activating the Deactivate case box, or the cells can be colored by activating the
Color it box. The search can be specified for a complete content (if the Whole cell box is
activated), or for certain elements within a text box. After each replacement or deactivation,
the searcher reports the number of cases that were found or deactivated.

Resampling
DATA menu SAMPLING/RESAMPLING allows the user to obtain samples from a
group of data by using the bootrap, jackknife, randomly with replacement, or randomly
without replacement methods. The bootstrap method conducts a random sampling with
replacement, and generates samples of size n equal to the size of the original sample, while
the option randomly with replacement allows the user to generate samples of a size different
from n. The column from which the samples are to be drawn should be indicated, as well as
classification and/or partition criteria, if these exist. Then, the user should select a sampling
technique (in the Resampling method panel), and the values to be reporter by the sampling
(Save panel). If bootstrap is selected, the number of samples to be extracted should be
entered in the Bootstrap field; if randomly with or without replacement is selected, the
number of samples to be generated should be indicated (in # of samples) as well as their
size (in Sample size). The values of the variable that make up each of the solicited samples
(Samples option) as well as one or several summary statistics for each sample (Mean,
Median, Maximum, Minimum, Range, Variance, Standard deviationS.D, Standard
error, Coefficient of variationC.V.Sum, Sum of squares, Median absolute
deviationMADPercentilesP01, P05, P10, P20, P25, P50, P75, P80, P90, P95, P99,
Kurtosis and Skewness).
The results are shown in a new table. If the values of the variable are solicited, the new table
will have a column for each sample. If one or more summary statistics are solicited, the new
table will contain each sample and each measurement in a column.

Color selection
DATA menu COLOR SELECTION, allows the user to color a group of previously
selected cells. When a variable is colored it appears with the color in the Variables selector
list. This characteristic is useful, for example, if colors are used to distinguish groups fo
variables.

33

Data Management

Merge tables
DATA menu MERGE TABLES allows the user to merge an active table to two or more
tables Horizontally or Vertically. The merge is done one table at a time.
A Horizontal merge adds columns to the active table to include the new information and
requires that the user select one or more merging criteria. Once these criteria are established,
a dialogue window will appear from which the table and columns to be merged (added) to
the active table should be selected. The window contains a list of tables open on the screen,
from which the table to be merged should be identified. If the desired table is not listed, the
Other table button should be pressed in order to open the corresponding table from its
location, and thus the table will be added to the list. Upon selecting a table from the list,
column (variable) names will appear with an activated check box, indicating which
variables will be added to the active table. The user can deactivate those which he does not
wish to participate in the process. In the case that both tables have the same column names,
upon adding the new information, InfoStat will place a number at the end of the name of the
added column in order to distinguish it from the other column with the same name. If the
user wishes to replace the content of the columns with the same name in the active table, he
should activate the Overwrite box.
Upon completing the horizontal merge, the solicited columns are added, but the information
from the original table is not included.
A vertical merge adds new rows to the active table in order to include the information
contained by coinciding columns and creates new columns for variables that do not
coincide. The process is similar to the one described for a horizontal merge, except that in
there is no need to specify merging criteria.

Rearrange columns, one under the other


DATA menu Rearrange columns, one under the other merges the content of two or
more columns in a single column. The columns to be merged should be selected in the
dialogue window the (Columns option) and the merge will be conducted according to the
selection order. The user may also choose to copy the information from a column of interest
(Copy... option). There is an option to conduct the merge with only the active cases. By
clicking Go, a new table that shows the results of the union is generated.

Rearrange rows as columns


DATA menu REARRANGE ROWS AS COLUMNS allows the user to transfer the
content of the rows of an active table to the columns of a new table, according to the
classification criteria established by the user. In the Columns option of the dialogue
window, the user should indicate the variables whose data will appear in the columns of the
new table, and in the Partition criteria option, he should indicate those variables which will
define the columns of the new table. The user may also copy entries of a particular column
of interest (Copy... option). The new table will appear upon clicking OK.

34

Data Management

Create a new table using active cases


DATA menu CREATE NEW TABLE USING ACTIVE CASES generates a new table
that will contain only the active cases of an open table that also contains inactive cases.

Merge categories
DATA menu MAKE A NEW COLUMN BY MERGING CATEGORICAL
VARIABLES allows the user to obtain the combinations that result from merging the
categories of two or more variables. In the dialogue window, under the Partition criteria
option, the user should indicate the variables he wishes to cross. Upon clicking OK, a new
column with the clases obtained by the merge will appear in the table.

Output
The OUTPUT menu shows
the actions that can be
applied to an active result
(the last result of an action
solicited from the Statistics
or Applications menu). In
order to activate another
previously obtained result,
click on the tab that
indexes that result and that
can be found at the foot of
the RESULTS window. Upon activating the OUTPUT menu, the user will be able to choose
from among the following options:

Upload results
This allows the user to open a file that contains results that have been saved during a work
session. The file name and location are specified in the dialogue window.

Save results
This allows the user to create a file containing results that have been obtained during a work
session. The file name and location are specified in the dialogue window. The files will have
a .ITRES extension.

Decimals
This item displays a submenu that allows the user to select the desired number of decimals
to be shown. At the bottom of this menu, an option for exponential notation appears; in the
35

Data Management

case that a result cannot show any significant digit with the specified number of decimals,
InfoStat will use exponential notation.

Field separator
This allows the user to select a type of separator (space, tab, comma or semicolon) as the
character that will separate the columns of a table; the default selector is a space. Usually
this separator does not need to be modified, but it can be useful to do so when results from a
table are exported.

Typography
This allows the user to change the typographical attributes (font style and font) used in
presenting results. This action can also be evoked by activating the A button on the
Toolbar.

Export results to table


This allows the user to export the text of a Results window as a table. Upon selecting this
action, a dialogue window called Text Importer will open. For details regarding operations
with this window, see OPEN TABLE in the DATA menu.

Access to results submenus through right clicking on the mouse


In addition to the actions presented in the RESULTS menu, the user can also access the
following options by right clicking on the mouse when a Results window is active:
Decimals: establishes the number of decimals that are shown in an active window.
Copy: copies the previously selected text, using tabs as field separators. The text can be
read directly in word processors for the construction of tables.
Delete: deletes the active result.
Delete present and previous windows: deletes the active result as well as all previous
results.
Print: prints the content of the active result.

36

Statistics

Statistics
InfoStat conducts different statistical analyses using an active data table. The selection of
the type of analysis is done from the STATISTICS menu. Each time a procedure is invoked,
the output is presented in a results window which can be formatted and prepared to be
exported according to the specifications given by the user from the OUTPUT menu.
The actions (submenus) that are
applied to the analysis of tables in
InfoStat, in the STATISTICS menu,
are the following: Summary
statistics,
Frequency
tables,
Probabilities
and
quantiles,
Estimating population parameters,
Sample size, One-sample inference,
Two-sample inference, Analysis of
variance, non-parametric ANOVA,
Extended and mixed linear models,
Linear regression, Correlation
analysis,
Categorical
data,
Multivariate analysis, Time series,
Fitting and smoothing
In general, these actions initially
invoke a window that is used to select
variables. In it, the user should
indicate the variable(s) of interest and
the desired partition, in the case that
the analysis is by group or partition in
the data file. In the variables selector,
the user can include variables of
interest by clicking on the arrows in
the Variables subwindow. The
variables that are generated should be
declared in the Partitions tab, the
Partition by command allows the
user to identify the variable(s) that
will be used to partition the analysis.
When more than one variable is
selected, the groups result from the combination of the levels of the selected variables.
37

Statistics

For example, if the variables seed color (light, dark and red) and seed size (large, medium
and small), three groups are generated upon selecting only the color (the three levels of
color). If, instead, both variables are selected, 9 groups will be generated. The partitions will
appear in a list to the right of the window that can be altered through the selection and
elimination of one or more groups that the user does not wish to participate in the analysis,
through the use of the displacement arrows found at the bottom of the list. Once groups have
been identified, InfoStat will repeatedly conduct the solicited analysis on the observations of
each group, separately.

Descriptive statistics
The first block in the Statistics menu allows the user to describe a group of data by means of
univariate summary statistics, frequency tables and theoretical distribution functions
adjusted to empirical distributions (sample frequency tables). All of these actions can be
conducted for the group of active tables, either as a whole, as a subgroup or partition of the
file, if the user indicates a partitioning variable in the Partitions tab. For summary statistics
and frequency tables, it is possible to work with files that have as many rows as
observations (see the Atriplex.idb file), or with files in which each row of the column of
interest represents a value of the variable and in which another column of the file contains
the frequency of each value (see the Insectos.idb file). In the first case, in the variables
selector, the variable(s) of interest should be indicated and the Frequencies field should be
left empty. In the second case, the column that contains the different values of the variable
should be indicated in the Variables window of the selector, and the column hat contains
the frequencies should be indicated in the Frequencies (only option) window. InfoStat also
provides a probabilities and quantiles calculator for different types of random variables.

Summary statistics
The following summary statistics are available: number of observations (n), Mean, standard
deviation (S.D.) variance with denominator n-1 (Var(n-1)), variance with denominator n
(Var(n)), standard error (S.E.), coefficient of variation (CV), minimum value (Min),
maximum value (Max), Median, quantile 0.25 or first quartile (Q1), quantile 0.75 or third
quartile (Q3), sum of observations (Sum), Asymmetry, Kurtosis, uncorrected sum of
squares (USS), corrected sum of squares (CSS) ,median absolute deviation (MAD), Missing
data, percentiles 5, 10, 25, 50, 75, 90 and 95 (P(05), P(10), etc.).
The number of observations reported corresponds to the number of active cases. The sample
statistics are calculated using the number of cases that remain after observations with
missing data have been omitted. The code for missing data can be entered by the user. The
Mean statistic refers to the arithmetic mean. The Standard deviation refers to the square root
of the sample variance, calculated as the sum of the squares of the deviations with respect to
38

Statistics

sample mean, divided by (n-1). The Standard error refers to the standard deviation divided
by the root of n. The Coefficient of variation is the quotient of the standard deviation and the
sample mean, expressed as a percentage.
The first quartile (Q1), the median and the third quartile (Q3), as well as any other
percentage can be obtained by ordering the sample and selecting one of the observed values
according to its position, or estimated based on an approximation of the empirical
distribution function. If the user selects Based on EDF in the Percentiles subwindow,
InfoStat will first estimate the function and then use this function to report the solicited
percentile. If the Sample option is selected, the percentile will be one of the sample values
obtained after the sample was ordered. For this reason, both procedures will not necessarily
produce the same numeric result.
Results can be presented horizontally or vertically. A horizontal presentation is useful to
export results to a new data table prior to conducting further analysis using a data table that
contains summary statistics.
Summary statistics for one or more variables can be simultaneously solicited from the file
(indicated in the variables selector). These summary statistics can be obtained using all the
observations from the file, or for a subgroup of observations. The subgroups can be formed
from a single variable or from a combination of two or more variables from the file. To form
groups, the user should indicate the variables that define the groups by listing these in the
Class variables (optional) subwindow in the variables selector. Alternatively, the Partition
tab can be activated to indicate the variables that partition the file; however this option is
less efficient than using the Class variables in terms of execution time. For this reason, we
recommend using the class variables option when the user wishes to obtain summary
statistics for a large number of subgroups of an extensive file.
To illustrate, we use data from the Atriplex file. Selecting STATISTICS menu
SUMMARY STATISTICS, we activate the Descriptive statistics window in which the
desired variable(s) are selected. If a variable is selected in the Partition tab to create a
partition in the file, the solicited summary statistics will be generated for each group or
partition. In this example, the variables Percentage and Normal Seedlings, and in the
Partition tab the variable Size was selected. The following summary statistics were
activated or solicited: n, Mean, S.D., Var(n-1), Min, Max, Median and P(50) estimated
from the empirical distribution function (this statistic does not coincide exactly with the
Median, since it is calculated using the sample data, whereas P(50) is calculated using the
distribution of the sample data. If in soliciting P(50) the Sample box is left activated, then
the Median and P(50) will be the same. The Horizontal presentation was selected. The
results are shown in the following table:

39

Statistics
Table 3: Summary statistics for variables in the Atriplex file, according to the partition by seed size
(horizontal presentation).
Summary statistics
Size
Variable
n
small Germinacin
9
small Normal Seedlings 9
big
Germinacin
9
big
Normal Seedlings 9
medium Germinacin
9
medium Normal Seedlings 9

Mean
54.56
24.44
73.33
51.33
68.78
50.67

S.D. Var(n-1) Minimum Maximum Median


26.34
694.03 20.00
93.00
60.00
20.24
409.53 0.00
60.00
20.00
19.28
371.75 40.00
93.00
80.00
22.12
489.50 27.00
87.00
47.00
32.81 1076.19 13.00
100.00
87.00
27.44
752.75 7.00
80.00
54.00

P(50)
48.67
20.00
71.00
42.33
80.00
40.50

Frequency tables
STATISTICS menu FREQUENCY TABLES, allows the user to obtain a frequency table
and/or test the adjustment of theoretical distributional models on an empirical distribution
table. Frequency tables can, according to the fields activated by the user, contain the
following information: lower limits
(LL) and upper limits (UL) of the
class intervals, mean of the interval
(MI) absolute frequencies (AF),
relative
frequencies
(RF),
cumulative absolute frequencies
(CAF) and cumulative relative
frequencies (CRF). The number of
classes
can
be
obtained
automatically or can be defined by
the user (PERSONALIZED). For the
automatic method, InfoStat obtains
the class number by taking
log2(n+1). For the personalized case,
InfoStat allows the user to specify
the minimum, maximum and the
number of intervals. The intervals
are closed on the right. If the
variable
is
categorical,
the
personalization is not accepted, and
the frequencies table shows as many
classes are there are categories for
the variable. If the values of the
variable were declared integers, by
default, InfoStsat considers the variable a count variable and shows the frequencies of all the
integer values between the minimum and the maximum. If the variable contains integer
values and the Consider integer variables as countings box is de-activated, InfoStat treats
the variable as continuous, and uses them to define class intervals and construct the table.

40

Statistics

Again, using data from the Atriplex file, we obtained a frequencies table for the germination
variable for each of the seed sizes, by invoking the following actions: STATISTICS
FREQUENCY TABLES, in the Frequense Distribution window, variables tab,
germination was selected, and before clicking OK, the Partitions citeria tab was activated,
and the variable size was added (all the seed sizes present in the file are automatically
visualized). Upon clicking OK, the Distribution of frequencies Frequency table options
window appears, from which the user can indicate the type of information he wishes to
visualize in the table and define the number of cases. In this example, all the default options
were accepted, and upon clicking OK, the number of classes was calculated automatically.
The results are shown in the following table:
Tabla 4: Frequency table for the germination variable from the Atriplex file, according to the
partition conducted by the variable seed size.
Frequency distribution
Size Variable
Class LL
UL
MI
AF RF
big Germination
1 40.00 57.67 48.83 3 0.33
big Germination
2 57.67 75.33 66.50 0 0.00
big Germination
3 75.33 93.00 84.17 6 0.67

Size
Variable
Class LL
UL
MI
AF RF
medium Germination
1 13.00 42.00 27.50 3 0.33
medium Germination
2 42.00 71.00 56.50 0 0.00
medium Germination
3 71.00 100.00 85.50 6 0.67

Size
Variable
Class LL
UL
MI
AF RF
small Germination
1 20.00 44.33 32.17 3 0.33
small Germination
2 44.33 68.67 56.50 3 0.33
small Germination
3 68.67 93.00 80.83 3 0.33

Fittings
STATISTICS menu FREQUENCY TABLES, Fittings tab, allows the user to obtain
goodness of fit tests. The null hypothesis specifies a theoretical distribution model for the
data. The values observed in the sample are compared to the expected values according to
the specified model, through the use of the Chi square statistic and/or the maximum
likelihood statistical significance, or G, test (Agresti, 1990). The user should select from
among one of these two statistics in order to conduct a goodness of fit test. Furthermore, he
should specify whether he wishes to estimate from the sample, or externally specify the
parameters of the theoretical distribution that, hypothetically, describe the data. If specify is
activated, as many check boxes as there are parameters in the selected theoretical
distribution will appear, so that the user may input information. The check boxes reserved
for each parameter of a distribution will automatically contain the values of the sample
estimators of each parameter. In the case of continuous variables, the empirical distribution
will be constructed from the automatically generated information on class intervals. These
intervals can be generated on lower and upper open or closed intervals, depending on how
the user specified these in the Frequency distribution - Fittings window.
41

Statistics

The following theoretical distributions that can be automatically specified in the null
hypothesis: Normal, Chi square (Chi Sq.), Uniform, Binomial, Poisson and Negative
binomial (NegBin) (see Data Management Chapter). The option None (selected by default)
allows the user to visualize the empirical distribution.
Example 1: The data from the Insects file show the observed frequencies for number of
insects per plant in a plot with 200 plants. These values are used to test the hypothesis that
the distribution of the variable fits the negative binomial distribution.
In the Frequencies distribution table, Variable subwindow, enter insects, and in
Frequencies enter observed. In Fittings, select NegBin (negative binomial distribution).
This will generate a table containing the observed absolute frequencies (AF), the expected
absolute frequencies corresponding to the proposed distributional model (E(AF)), and the p
value of the goodness of fit test.
Table 5: Goodness of fit test (Pearson Chi square statistic) for the hypothesis that the observations
have a Negative Binomial distribution with parameters estimated from the sample. Insects file.
Frequency distribution
Fitting a Negative Binomial, k= 1.10921 and mean 1.20500 (estimated parameters)
Variable Class MI AF RF E(AF) E(RF) Chi-square
p
Insect#
1 0 89 0.45 88.46 0.44
3.3E-03
Insect#
2 1 52 0.26 51.09 0.26
0.02
Insect#
3 2 24 0.12 28.06 0.14
0.61
Insect#
4 3 15 0.08 15.14 0.08
0.61
Insect#
5 4 10 0.05 8.10 0.04
1.05
Insect#
6 5 5 0.03 4.31 0.02
1.16
Insect#
7 6 4 0.02 2.28 0.01
2.45
Insect#
8 7 1 0.01 1.21 0.01
2.49
Insect#
9 8 0 0.00 1.35 0.01
3.83 0.6991

A p value less than the nominal significance level of the test leads to the rejection of the
proposed distributional model. In this example, we can say that the distribution of the insect
count can be modeled by a Negative binomial distribution with the parameters specified in
the heading of the table, since p>0.05. The parameters are automatically estimated from the
model under study.

Probabilities and quantiles


InfoStat provides a calculator that can be used to obtain the probability of values less than or
equal to a specified value (Cumulative Probabilities), for an ample selection of random
variables. Probabilities can be calculated for the following distributional models: Uniform
(a,b), Normal (m,v), Student t (v), Chi-square (v,lambda), Non-central F (u,v,lambda),
Exponential (lambda), Gamm (lambda, r), Bet (a,b), Weibull (a,b), Logistic (a,b),
Gumbel (a,b), Student range (k,v), Poisson (lambda), Binomial (n,p), Geometric (p),
Hypergeometric (m,k,n) and Negaitve binomial (m,k) (see Data Management Chapter).

42

Statistics

For each model, the user should specify the value(s) of the parameters, whose notation can
be found in parentheses to the side of the name of the distribution.
InfoStat provides distributional quantiles for these models.
To obtain a Probability value, the user should first select the theoretical distribution for
which he wishes to calculate probabilities, and then he should enter the parameters which
characterize that distribution.
If the user wishes to know the cumulative probability up to a certain value (x) of that
distribution, he should activate the X value check box by entering the value of the random
variable for which he wishes to obtain the cumulative probability. By clicking Calculate or
activating the Enter key, he will be able to read the probability of occurrence of values less
than or equal to x, according to the proposed distributional model, in the Prob. (X x) check
box. The complement of Prob. (X x) will appear in the Prob. (X>x) check box. The
probability that a discrete variable will assume values equal to x, according to the proposed
distributional model will appear in the Prob. (X=x) check box (if a model for continuous
variables is selected, this value will always equal zero). If the user wishes to know quantile
p of the selected distribution, he should enter the p value in the Prob. (X x) and click
Calculate. The p-th quantile of the proposed distributional model can be read in the X value
check box, where p [0,1].

Estimators of population characteristics


This module allows the user to estimate population characteristics in sample studies
designed with the following techniques: simple random sample, stratified sample, and
cluster sample.

Definitions of terms associated with the sampling technique


A population (or universe) is a group of elements or entities that share some attribute and
whose temporary or spacial limits can be established. Populations can be finite or infinite,
depending on their size. Finite populations have objects that can be numbered. The element,
or unit element, is an object or individual of the population for which the measurement
under study is taken. A sample is a non-empty subgroup of the population denoted by
{X1,X2,...,Xn}. Not every sample is adequate or pertinent to the objectives of a study; this
creates the need to design a sampling scheme and obtain estimation according to the
technique used in the collection of information. The elements or groups of elements that are
the object of the selection through a sampling process are known as sample units. The total
group of sample units in a population is defined as the sampling frame.
For example, one might wish to know the degree of Mediterranean fly infestation of the
fruits of a peach plantation. The population is the group of all the peaches on the plantation.
The unit element is the peach. It may be difficult to construct the sampling frame using the
43

Statistics

individual peaches, but one could do so using each tree, in which case the unit element
would be each peach tree. The sampling frame would be the union of all the trees on the
plantation under study.
The parameters are constants that characterize a population; for example, the population
mean, the proportion of cases with a given attribute, the total for a given attribute, and the
population variance. The estimators are functions defined for the space of all possible
samples of a given size, and the purpose of their image is to provide information for the
value of the population parameters. Examples of estimators are the mean and the sample
variance.
InfoStat accepts two types of variables to generate estimations of population parameters.
The characteristics under study can be continuous of dichotomous. Dichotomous
characteristics allow the user to estimate population parameters related to the proportion of
successes or cases in a given class. If the user wishes to convert a continuous variable into a
dichotomous (or dummy) variable, InfoStat allows him to dichotomize variables from a
comparison of each of its values to a reference value. The point that allows the
dichotomization can be the mean of the characteristic, the median or a random value entered
by the user. The user can dichotomize by labeling as success values greater than, less
than, greater than or equal to, or less than or equal to a given reference value entered by the
user of the variable under study.
Let {X1, X2 ,..., XN } be the group of all the values in the population (population of size N).
Then, we define total, mean and variance parameters as follows:

Total
Mean

= i =1 X i
N

Variance
=
2

1
N

1
N

N
i =1

Xi

(X
N

i =1

In a population of size N, the number of possible samples of size n, from a sample without
replacement, is C(N,n) (combinations of N taken from n). For example, if N=30 y n=2
C(30,2)=435. If a sample statistic is calculated from a each sample, there would be 435
different sample estimators. The standard error of an estimator refers to the square root of
the variance of the sample distribution. The coefficient of variation of a population
parameter estimator is defined as the quotient between its standard error and the true value
of the estimated parameter. The square of the variance coefficient of an estimated parameter
is referred to as the relative variance of the parameter under study. The standard error of an
estimator is a measure of the sample variability of the estimator for all possible samples. By
ensuring that the distribution of the estimators is close to the normal distribution when the
sample size is large enough, it is possible to use normal theory to obtain approximate
confidence intervals for the parameters that are being estimated. The confidence interval (1)% for the parameter would be as follows:
44

Statistics

()

Z1 EE ,
2

where is an estimator for ; Z1 is the percentile (1 2 )100 of the standard normal


2

distribution and EE ( ) is the standard error of .


For each different type of sampling and estimator, InfoStat allows the user to obtain the
standard error, the coefficient of variation of the estimator, the relative variance and the
confidence interval for the parameters estimated with the confidence coefficient specified by
the user.

Simple random sample


STATISTICS menu ESTIMATING POPULATION PARAMETERS RANDOM
SAMPLING, allows the user to estimate population parameters in a simple random simple
frame. The simple random sample (s.r.s.) is a sampling plan in which a sample of size n is
taken following a procedure such that every sample of size n (of a population with N
elements) has the same probability of being selected. The total possible number of samples
is T=C(N,n). The probability of selection of a sample mj of size n is:
P(mj)=1/T

where j=1,...,T .

InfoStat assumes that the values of the columns of the data table correspond to sample
values of one or more characteristics under study. In the dialogue window of the variables
selector, the user should indicate which column(s) in the table contain these characteristics.
When various classification criteria exist for the population, but for theoretical or practical
reasons it is not convenient to conduct a stratified sample, the user can conduct estimations
within these subdomains by using a simple random sample. The population can be finite, in
which case the population size should be entered.
For convenience, we denote the sample elements from the first to the n-th with the following
notation: x1 ,..., xn . These then become the values of the X variable for the elements
numbered 1 to n. After having taken the sample, it is possible to calculate values such as:
totals, means, proportions, standard deviations, etc.
For a simple random sample, InfoStat estimates the total, the mean and the proportion of
success (and total number of successes) in the following way:

tmas =

N
n

X mas =

x
i =1

1 n
xi
n i =1

1 n
p=
( xi )
mas
n i =1
45

Statistics

where ( xi ) is the indicator function that evaluates observation xi and produces a 1 or 0,


depending on whether the observation represents a success or a failure, respectively.
Confidence intervals for the population parameters, with a confidence level specified by the
user, may be needed. By default, the constructed interval has a confidence level of 95%. To
construct such intervals, the standard errors of the corresponding estimators are used, which
are calculated as the square root of the following variances:

N n S X2
N n
N n S X2
V ( X mas ) =
,
N n
N n p (1 p )
V ( pmas ) =
N
n 1
V (tmas ) = N 2

where S X2 is the unbiased estimator of the population variance of characteristic X under


study, under the assumption that the population is infinitely large and is defined as follows:

=
S X2

2
1
n
X

X
(
)

i
n 1 i =1

The preceding estimators include a finite correction factor used when populations are finite.
If the population size is not specified, InfoStat assumes that the population is infinite and
does not use the finite correction factor. Furthermore, the user can solicit the coefficient of
variation and the relative variance associated with the obtained estimation.
Upon invoking this submenu in InfoStat, the Random sampling window appears, which
allows the user to select the desired variables and partitions. InfoStats Partition criteria
can be used in this menu to obtain estimates for different partitions of the file, defined as a
function of one or more variables. In the case that subdomains exist, the user should
indicate which column in the data table identifies these. If a column of the file contains
absolute frequencies for each value of the characteristic under study and each column is
indicated in the Frequencies subwindow, InfoStat will use this information to weigh the
values of the characteristic by the frequency for any estimation that is solicited from then
on. Upon accepting, another window is activated which allows the user to Insert
population size. The Continuous characteristics option activates the Population
parameters to be estimated window from which the following choices can be activated:
Mean, Total, Proportion of success, and Success. When one of these last two options is
selected, the option Dichotomize by appears automatically, and Consider as success
values... has the following options: greater than the mean, greater than or equal to the
mean, less than the mean, less than or equal to the mean, greater than the median,
greater than or equal to the median, less than the median, less than or equal to the

46

Statistics

median, greater than, greater than or equal to, less than, less than or equal to a given
value entered by the user in the window available for this purpose.
The following options can be found on the bottom part of the main window: Standard error
of the estimator, Confidence interval for the population parameter, Coefficient of variation
of the estimator, and Relative variance.

Stratified sample
STATISTICS menu ESTIMATING POPULATION PARAMETERS STRATIFIED
SAMPLING, allows the user to obtain estimations in the frame of a stratified sample. In this
kind of sampling frame, the population is divided in strata and a simple random sample is
taken from each stratum. Let the size of stratum h be denoted by Nh, and let nh be the size of
the sample obtained from that stratum (where h=1,...,L); the total possible number of n is
given by:

N
N
N
T = 1 ... h ... L ,
n1
nL
nh
L

where

h =1

= n.

For example, if there are three strata denoted by E1, E2 and E3, of size 3, 5 and 4,
respectively, the total number of possible samples of size 2, 3 and 2 for the strata mentioned
would be as follows: 3, 10 and 6, An example showing all the possible samples for the
formation of the mentioned strata is presented below:
Population
Strata
1
1
1
2
2
2
2
2
3
3
3
3

X
10
11
9
12
13
11
14
13
17
19
18
20

Possible samples
E1
E2
M1 10 11
M1 12 13
M2 10 9
M2 12 13
M3 11 9
M3 12 13
M4 12 11
M5 12 11
M6 12 14
M7 13 11
M8 13 11
M9 13 14
M10 11 14

11
14
13
14
13
13
14
13
13
13

M1
M2
M3
M4
M5
M6

E3
17 19
17 18
17 20
19 18
19 20
18 20

The estimators by strata (indexed by h) for the total, mean and proportion of the population
are as follows:
47

Statistics

th =

Nh
nh

1
Xh =
nh
ph
=

1
nh

nh

x
i =1

ih

nh

x
i =1

ih

nh

( x
i =1

ih

where xih is the i-th value of the observed variable in stratum h and I(xih) is an indicator
function that when evaluated at observation Xi returns a 1 or a 0, depending on whether
the observation represents a success or a failure, respectively.
These estimators have the same form as do the estimators for a simple random sample
within each stratum. Therefore, the mean variance of the estimators (me) for L strata is
constructed based on the variances of the estimators, by strata.

V ( tme ) = N h2
L

h =1

S h2 N h nh

nh N h

2
N S N n
V ( X me ) = h h h h
Nh
h =1 N nh
L

N p (1 ph ) N h nh
V ( pme ) = h h

nh 1 N h
h =1 N
L

where S h2 is the variance of the random variable in stratum h.


In some cases, the sample units cannot be classified a priori as belonging to a given stratum.
If that information is obtained during the sampling process, then a post-stratification
sampling is conducted. This technique is based on the application of estimators for different
strata on a simple random sample, after the sampling units have been classified into
different strata. The difference between this technique and the estimation for subgroups,
within the framework of simple random sampling, is that in this case, strata size is known.
The variance of the estimators is corrected to account for the fact that sample sizes by
stratum are random.
Upon invoking the STRATIFIED SAMPLING submenu in InfoStat, the Stratified
sampling window appears; this allows the user to select the desired variables. InfoStats
participation criterion can be used in this menu to obtain estimations for different partitions
of the file, defined as a function of one or more variables. In this case, the user must specify
the stratum. If there is a column containing absolute frequencies for each value of the trait
under study, and that column selected in the Frequencies submenu, Infostat will use this
information to weigh the values of the trait by their respective frequencies for any
subsequently solicited estimation. Upon accepting, another window is activated that
48

Statistics

contains a list of names of names of the strata, in which the user should enter Strata size.
If working with a post-stratified sample, the user should activate the corresponding field.
The continuous characteristics option activates the Population parameters to be
estimate submenu, from which the following options can be activated: Mean, Total,
Proportion of successes, and Successe. Upon activation of either of the last two options,
the user is asked to Dicotomize by, and Consider as success values has the following
options: greater than the mean, greater than or equal to the mean, less than the mean,
less than or equal to the mean, greater than the median, greater than or equal to the
median, less than the median, less than or equal to the median, as well as greater than,
greater than or equal to, less than, less than or equal to a value specified by the user in
the window designated for this purpose.
The following options can be found at the bottom of the window: Standard error of the
estimator, Confidence interval, Coefficient of variation of the estimator and Relative
variance.

Stratified sampling
STATISTICS menu ESTIMATION OF POPULATION PARAMETERS
STRATIFIED SAMPLING, allows the user to estmate parameters within the framework of
stratified sampling. This type of sampling is used when it is not possible, or it is impractical,
to use a sampling framework based on the elementary sample units, and it is instead possible
to use a sampling framework based on groups (strata) of sample units.
However, if the user wishes to estimate, for example, the degree of damage caused by the
Mediterranean fly to peach plants, and the field has a total of 20 plants, each plant could be
considered a stratum. Of these strata, m are randomly selected, then for each of the selected
plants, the number of healthy and unhealthy fruits are counted in each of the main branches.
Various sampling strategies are grouped under the category of stratified sampling, but each
of these generates different estimators and errors. InfoStat runs estimations according to
simple one-stage stratified sampling. Simple one-stage stratified sampling is characterized
by the selection of a group of m strata, according to a simple sampling plan, which are
subsequently analyzed.
The notation used for this sampling framework is as follows:

M= number of strata in the population


m= number of strata sampled
nc= number of units per stratum
N= population size
N = average stratum size
The estimators for this sampling strategy for continuous characteristics are as follows:
49

Statistics

M m
xij = population total
m =j 1 =i 1
nc

t =

M m
xij = population mean
Nm =j 1 =i 1
nc

X =

nc

ij

=j 1 =i 1

tc =

= stratum total

m
nc

ij

Xc =

=j 1 =i 1

= stratum mean

mN

For binary characteristics, InfoStat allows users to estimate the proportion of successes and
the total number of successes. When the variable is continuous, the proportion of successes
and the total number of successes can be calculated prior to the dichotomization of the
continuous variable.
The variances of the estimators are obtained in the following manner:
2

x X
2 ij
M =j 1 =
i1
M m
V (t ) =
m

m 1

V (X ) =

nc

2
=j 1 =i 1

ij

M m 1

m 1

N2

M
2

xij tc

1=j 1 =
i1
M m
V (t c ) =
m

m 1

V (Xc ) =

M
2

xij tc

1 =j 1 =
i1
M m
m

mN 2

nc

m 1

Upon invoking this submenu in InfoStat, the window entitled Stratified sampling appears
which can be used to select the desired variables and partitions. InfoStats partition criteria
can be activated from this menu to obtain estimations for different partitions of the file,
defined as a function of one or more variables. In this case, the user must indicate the strata
(by indicating the column of the data table that identifies the strata). If there is a column in
50

Statistics

the file that contains absolute frequencies for each value of the characteristic under study,
and said column is indicated in the Frequencies subwindow, InfoStat will use this
information to weigh the averages of the characteristic by its frequency for any subsequently
solicited estimation.
Upon accepting, another window appears where the user should enter the Number of strata
in population (M) and the Average stratum size (N). The Population parameters to
estimate submenu allows the user to activate the following options: Average, Total,
Proportion of successes, Total successes. When the user wishes to create dichotomous
continuous variables, he should select Dichotomize by and in the Success values option, he
can select any of the following: greater than the mean, greater than or equal to the
mean, less than the mean, less than or equal to the mean, greater than the median,
greater than or equal to the median, less than the median, less than or equal to the
median, as well as greater than, greater than or equal to, less than, less than or equal to
a value specified by the user in the window designated for this purpose.
The following options can be found at the bottom of the window: Standard error of the
estimator, Confidence interval, Coefficient of variation of the estimator and Relative
variance.

Sample size calculation


Menu STATISTICS SAMPLE SIZE, it allows necessary sample size calculation to
estimate a mean or a population proportion with confidence and precision determined by the
user. Also, it allows sample size calculation to detect, in the context of one-way
classification of fixed effects ANOVA, a difference between groups or population means as
small as specified by the user and the sample size to estimate the difference between two
populations. By entering to the submenu Sample size, the options are: Estimating a mean
with a given precision, To detect a given difference between means, To estimate a
proportion, and To estimate a difference of proportions.

Estimating a mean with a given precision


This method presupposes an s.r.s. (simple random sampling) and its objective is to give an
approximation, based on normal distribution, of the necessary sample size to estimate the
mean with specified confidence and precision. The approximation used to sample size
calculation in InfoStat is:
2Z1 2
n
c

51

Statistics

where is the population standard deviation, in which the value or upper bound have to be
sing in, c is the required amplitude to the confidence interval with (1-)% confidence to the
population mean. The c value can be chosen arbitrarily or it also can be expressed in an f
fraction of sample mean ( c = xf ) .
Alternatively, the user can specify the maximum acceptable standard error for the estimate,
as a criterion for sample size calculation.

To detect a given difference between means


For a balanced design with a treatments or populations under study (model with fixed
effects), InfoStat provides the sample sizes associated to power values, for the test of null
treatments effects, specified by the user. The sample sizes per treatment are derived from the
a

relationship between 2 =

n i2
i =1

a 2

and the power given by P(F0 >F,a-1,N-a /H0 is false), where

i is the effect of the i-th treatment, 2 the common variance inside the treatments, a the
number of treatments, the significant level of the test of null treatments, N the total
number of observations and F0 the ANOVAs statistic.
To prevent the user has to select the set of i, i=1,...,a, this one is calculate based on the
expression 2 =

nD 2
2a 2

where D is the minimum difference to want detected between two

means.
If the difference between two means is at most D, the value and consequently the sample
size obtained are conservative; this provides a power at least equal to the specified by the
user.
In the subwindow Criteria to obtain simple size, it is possible to specify two alternatives:
Range of confidence interval or Standard error of estimation <=. To the
extent, that changes the options for these two alternatives, at the bottom you will see a space
to put the Upper limit for the variance and thus estimate the Required sample size.
Detect a LSD (least significant difference) allows calculate the Power for a model with
fixed effect of variance analysis, when the next options are changed: Treatment number,
Pooled within variance, Significance level, Minimum difference to be detect, Rep. by
treatment (n).

To estimate a proportion
It presupposes an r.s.r. (simple random sampling) and its objective is to give an
approximation based on normal distribution, to the necessary sample size to estimate a
52

Statistics

proportion with a specified confidence and precision. The approximation used to calculate
the sample size in InfoStat is:
2 Z1 2 p (1 p )
n

where p is a priori assumed population proportion, for this one the user have to sign in the
value besides a moving bar in a range between 0 and 1, c is the required amplitude to the
confidence interval, expressed as a percentage of p, with a (1-)% confidence for true
proportion.

To estimate a difference between two proportions


In the context of a simple random sampling, where we wish to estimate a difference between
two proportions from samples of equal size, InfoStat provides the sample size to draw from
each population and the associated values to power hypothesis testing about no differences
between proportions from a normal distribution (see estimation of a difference between
proportions).

Inference in one and two populations


InfoStat allows to test hypotheses and to obtain confidence intervals for a statistic model
parameters involving one or two populations. These module menus can indicate if an
inference is based on one or two random samples. The actions (submenus) than we can use
in an one sample case are: One-Sample t-Test, Runs test, Confidence intervals,
Normality test (modified Shapiro-Wilks), Goodness of fit test (Kolmogorov) and
Goodness of fit test (multinomial). In the case with two samples: T-test, Wilcoxon
(Mann-Whitney U), Wald-Wolfowitz test, Van der Waerden test, BellDoksum-test,
Kolmogorov-Smirnov, Irwin-Fisher, Median test, Differences in proportions, Paired Ttest, Wilcoxon test, Sign test and F-test for two variances.
In case of applying the analysis to more than one response variable, the results are reported
for each variable separately.

Inference based on one sample


One-Sample t-Test
Menu STATISTICS ONE-SAMPLE INFERENCE ONE-SAMPLE T-TEST, it allows
to prove a hypothesis about hope of a random variable, of H0: =0 types. The test uses a
variance estimation of the response variable.

53

Statistics

InfoStat provides the p value for a bilateral test, p(Bilateral), or the p value for right
unilateral tests, p(Unilateral D), or left, p(Unilateral I), as specified. When p value is tan
nominal significant level (selected to the test), the statistic belongs to rejection region, i.e.
the test suggests the rejection of null hypothesis.

X
0
The test statistics is: T =
that under H0 has a T-Student distribution with n-1
S

freedom degrees.
Activating the submenu ONE-SAMPLE T-TEST in InfoStat, appears a window with same
name that permits to chose the studied variable, and if we wish the variables that defined
partitions. Next window allows requesting the information to show and to select the kind of
test to make: Two tails, One tail (right) or One tail (left). Default, InfoStat will show next
information: n (sample size), Mean, SD (standard deviation), T (statistic value) and p (p
value) and the Confidence interval (default the confidence is 95%, but we can opt for
another value activating the appropriated field). Parameters field permits to introduce with
the keyboard the population mean hypothesized value, i.e. 0.
Continuing with Atriplex data file, presents the test results about the mean of germination
percentage. Suppose that you wish to prove the hypothesis H0: =50. Then, signing the
value 50 in parameter field and accept the enable options, the followings results were
obtained (analysis was performed twice, one using a partition of the file by seed size, and
another one without partition).
As shown, the germination percentage is significantly different of 50%, just for big seeds.
Germinations mean suggests that bigger seeds have a germination percentage more than
50%. Working with all the data, without partitions by size, also the null hypothesis is
rejected.
Table 6: Results for One-sample t-Test by seed size using Germination. File Atriplex.
One-Sample t-Test
Mean value under the null hypothesis: 50
Size
Large

Variable
Germination

n
9

Mean
73.33

SD
19.28

LL(95) UL(95) T
58.51 88.15 3.63

p(Two tails)
0.0067

Medium

Germination

68.78

32.81

43.56

93.99 1.72

0.1243

Small

Germination

54.56

26.34

34.31

74.81 0.52

0.6180

Table 7: Results for One-sample t-Test for Germination. File Atriplex


One-Sample t-Test
Mean value under the null hypothesis: 50
Variable

54

Mean

SD

LL(95) UL(95)

p(Two tails)

Statistics
Germination

27

65.56

26.93

54.90

76.21 3.00

0.0059

Runs test
Menu STATISTICS ONE-SAMPLE INFERENCE RUNS TEST, it allows to prove
the hypothesis of one random order against tendency alternative (no random order), using
runs.
A run is a sequence of one or more elements that is preceding and/or following bye different
elements from those that make the runs. Dummy variables will identify a run when there is a
sequence of values of the variable that belongs at same category. For example, if there is a
series: 1 0 0 0 1 1 0 0 1 1, where there are three 1s runs (large 1, 2 and 2) and two 0s runs
(large 3 and 2).
For example, suppose that are taken diary measures of economic indicator. A run will
identify when there is a group of consecutive measures where each diary value is bigger
than the each one of the previous day. In that case the variable is not a dummy. The user can
indicate a value, as the median, to establish the new dummy series by the comparison of
each original observation with this value.
The R statistic is based on the number of runs, last exercise R=5. When sample sizes tend to
infinite, Wald and Wolfowitz show that standardization of R statistic tends to standard
normal distribution (Lehmann, 1975), so we can use a normal approximation to calculate p
values.
InfoStat allows making this test activating the submenu RUNS TEST. This brings a window
with the same name that permits to choose the variable under study and to define partitions.
To accept another window appears where you can choose: Runs above and below the
Mean, Runs above and below the Median (default) and Specified threshold value
(enables a window to enter a value). Another information is: n1+n2, n1, n2, runs, E(R),
Stand. Z, p (2 tails), where n1 and n2 are the numbers of runs to the classes 1 and 2 to the
dummy variable in study; runs correspond to the test statistic; R is the number of runs of
one of the classes (the observation of the first file); E(R) is the R statistical hope defined as:
2n n
=
E ( R) 1 2
n1 + n2

+1

Stand. Z is the standardized statistic value:


=
Est Z

R E ( R)
=
con S
S

n1n2 (2n1n2 n1 n2 )
(n1 + n2 ) 2 (n1 + n2 1)

The p value to the test to null hypothesis is obtained by activating p(2 tails), which can be:
Runs above and below the Mean, Runs above and below the Median and Specified
threshold value that user specified.

55

Statistics

When n1 and n2 values are below 30, InfoStat obtains the exact p values from the R statistic
distribution. If the n1 and n2 values are greater than 30 the p value is obtained from Stand. Z
statistic.

Confidence intervals
Menu STATISTICS ONE-SAMLE INFERENCE CONFIDENCE INTERVALS, it
allows to obtain parametric confidence intervals with confidence coefficients specified by
the user to the parameters Mean, Median, Variance and Proportion. These intervals with
the confidence interval for a Percentile of the distribution can be obtained of no-parametric
way from Bootstrap resample technique (Efron & Tibshirani, 1993).
A confidence interval with level is defined as parameter values set (interval) that with a
confidence (1-)100% would include the parameter value to the population, given the
variability in the sample and the shape of the sampling distribution of the estimator.
Parametric confidence intervals are built from assumptions about the shape of the sampling
distribution of the estimator (Normal, T Student, Chi Squared, etc.).
The /2 y (1-/2) quantiles of the sampling distribution to the used statistic to build the
interval, are selected to obtain the upper and lower limits of one interval with level around
the parameter. Built intervals by this process have, by chance, the possibility to not include
the true value of the parameter (type I risk), but expected that event happened just in
100% to the obtained intervals.
Considering the example of construction of a confidence interval, with 0.05 level around the
population Mean . Known by the central limit theorem that sample mean, X , is
approximate distributed normal around with standard error / n to big simple sizes, n.
Standard normal distribution (when is known) or T-Student (when is estimated by
calculated S with sample data) can provide us the probability to extract randomly a sample
mean that is moved a determined number of standard deviations of. For example, the
chances are of 1 in 20 to extract a mean that is at least 1.645 standard deviations, bigger than
population mean if the statistic distribution is normal. Using this idea is built the confidence
interval for population mean from sampling distribution of X as follow.
P( X - Z1-/2 2 n X + Z1-/2 2 n ) = 0.95
Then, the confidence interval limits to the mean with =0.05 level are:
LL = X - 1.96 2 n and UL = X + 1.96 2 n

56

Statistics

In the practice, variance is estimated since sample, so the statistic to used must be

X
X
0
0
T=
not Z=
. Confidence interval limits to the mean that InfoStat reports

n
n

are calculated as:


LL = X - T1-/2 S 2 n and UL = X + T1-/2 S 2 n
Its possible that we are not sure that conditions whose guaranties the distribution of our
statistic are met and therefore we do not know the distribution in the statistic sampling that
we are using to built the confidence interval. To these situations, InfoStat allows to select a
non parametrical interval construction technique based on resample process known as
bootstrap. Bootstrap technique involves removing randomly from a sample with reposition
B samples with size n since original sample with size n. Each of B bootstrap samples (per
default B=250), InfoStat will calculate interests statistic (last example, sampling mean) and
sorting ascending the B estimations identify the quantiles, that will be used as bootstrap
confidence interval limits to the parameter of interest. Well, selecting Bootstrap
estimation, bilateral interval limits with (1-)100 confidence correspond to (/2)100 y
(1-/2)100 percentiles from obtained estimation list in B bootstrap samples extract to
original sample.
If you select Parametric Estimation, intervals are built under the normal theory for
parameters Mean, Median and Variance. The confidence limits are calculated from next
expressions:
Mean : X T1 / 2 /
Median : Me T1 / 2

S ( n 1)
S ( n 1)
=
=
Variance : LI
; LS
2
2
2

1 / 2

/ 2

where T1 / 2 and 12 / 2 are the (1-/2) quantiles of T Student and Chi Squared
distributions respectively.
Confidence intervals construction for success rate, InfoStat uses directly the quantiles from
Binomial (n,P) distribution associated to sampling success number statistic, with n as
repetition number and P the population success rate. Well, confidence intervals are exact.
Several Statistic texts present the intervals based on normal asymptotic distribution to the
sampling proportion. These ones pointless is you can get the exact intervals and therefore
are not offer by InfoStat.

57

Statistics

When you want an interval for P (success rate) and there is not a binary variable, but there is
a quantitative variable, InfoStat allows to built dicothomous variables by user definition
with the criterion used to decide if a value to the quantitative variable has to be considered
as a success (1) or failure (0). To select confidence interval for the proportion a sub window
is enable: Success: values >, >=, <, <= or = than a value specified by the user in the
placeholder to enter this value.
Its possible to chose between bilateral or unilateral intervals both right and left. In the field
labeled Confidence, the user has to enter the (1-)100 value, where is the risk of making
type I error.
Example 3: Weights are available for 20 individuals of a class and must be estimated with a
confidence interval, with 0.05 significant level, the average weight in the population of
individuals of this class. The file Weight contents the 20 values recorded in the sample.
Activating the Menu STATISTICS ONE-SAMPLE INFERENCES CONFIDENCE
INTERVALS, and selecting the variable Weights as analysis variable, the window of
options Confidence intervals appears to accept. If you select Mean, Confidence 95%,
Parametric estimation, Bilateral will obtain the next result:
Table 8: Parametric estimation of Confidence intervals. File Pesos.IDB2
Confidence intervals
Two tails
Parametric estimation
Variable
Weights

Parameter
Mean

Estimate
68.50

S.E.
0.80

n
20

LL(95%))
66.82

UL(95%))
70.18

The quantile used in the interval construction is T1-/2 belonging to T Student distribution
with (20-1)=19 degrees of freedom (d.f.), it is T1-/2 =2.09, because the confidence
established was (1-0.05)100 = 95% with a sample size n=20. We can conclude with a 95%
confidence that interval [66.82;70.18] includes the value of average weight in the population
from which the sample was drawn.
For the same example, if requested bootstrap confidence interval, selecting Bootstrap
estimation, you can obtain a very similar result. It suggests that this technique should be
useful when the distribution in the sampling of the used statistic is not known.
Table 9: Table 10: Bootstrap estimation of Confidence intervals. File Pesos.IDB2
Confidence intervals
Two tails
Bootstrap estimation (B=250)
Variable
Parameter
Estimate
Weights
Mean
68.52

58

S.E.
0.80

n
20

LL(95%))
67.00

UL(95%))
70.09

Statistics

Normality Test (modified Shapiro-Wilks)


InfoStat allows can test if the studys variable has normal distribution. Test scenarios are:
H0: observations have normal distribution vs. H1: observations do not have normal
distribution.
To do this test selects Menu STATISTICS ONE-SAMPLE INFERENCE
NORMALITY TEST (MODIFIED SHAPIRO-WILKS). The test used modified ShapiroWilks statistic by Mahibbur y Govindarajulu (1997).
The results to the test of modified Shapiro-Wilks for germination data from Atriplex file are
presented:
Table 11: Results of Shapiro-Wilks normallity test. File Atriplex.
Shapiro-Wilks (modified)
Variable
n
Mean
PG
27
65.56

S.D.
26.93

W*
0.86

p(one tail)
0.0020

In this case, there are evidences to reject the assumption of normal distribution (p<0.05).
Goodness of fit test (Kolmogorov)
Menu STATISTICS ONE-SAMPLE INFERENCE GOODNESS OF FIT TEST
(KOLMOGOROV), it allows to test if the sample available follows a theoretical distribution
model. It assumes that you have a random sample to be tested if the empirical distribution
fits some of the following distributions: Normal (mean, variance), T-Student (v), FSnedecor (u, v, l), Chi-square (v), Gamma (lambda, r), Beta (a, b), Weibull (a, b),
Exponential (lambda) or Gumbel (a, b), (see Chapter Data Management). Theoretical
distribution must be completely specified (known parameters). Hypotheses are:
H0: G(x) = Ftheoretical(x) vs. H1: G(x) Ftheoretical(x), for at least one x
where G(x) is the function of empirical distribution (or observed values) and Ftheoretical(x) is
the function of theoretical distribution specified by the user. The test is sensitive to any
discrepancy between the distributions (dispersion, position, symmetry, etc). The statistic is
based on maximum difference between both distributions and is defined as maximum
vertical difference between G(x) and Ftheoretical(x). D statistic of Kolmogorov is:
=
D sup { Fteorica ( x ) G ( x ) }
x

InfoStat provides exact p values for bilateral tests obtained from distribution of D statistic,
which corresponds an asymptotic approximation (Hollander y Wolfe, 1999).
The goodness of fit test of Kolmogorov is used when the function of hypothesized
distribution is completely specified i. e. no need estimate any unknown parameter from the
sample. Therefore, selecting a distribution, InfoStat enables as many fields as parameters
characterize it, so the user can enter the value of each parameter. When introduced as
59

Statistics

parameter values the sample estimations, the test can be so conservative. Chi-square test,
available in menu Frequency table-fittings, is less affected by the incorporation of
estimations from the sample. However, in some cases the power from Chi-square goodness
of fit test is very low (Conover, 1999).
In the window Frequency distribution, will be specified the variable and theoretical
distribution selected for adjustment and estimations by maximum-likelihood of the
parameters obtain from observations in active data table. Also reported the value of statistic
(D) and p value corresponding, which were obtained from asymptotic approximation of the
statistics distribution.
Example 4: In a study about fattening steers150 records of weight (kg) were obtained. We
want to know if the sample should adjust to a normal distribution model with mean 400 and
variance 400. Steers file contains the observations. Results about goodness of fit test of
Kolmogorov are presented in the next table:
Table 12: Results goodness of fit test of Kolmogorov. Steers file.
Goodness of fit test (Kolmogorov)
Variable
Fit
Mean
Weight Normal(400,400) 399.57

Variance
478.23

n
150

D Statistic
0.08

p-value
0.3201

P values less than signification level suggests the rejection of H0. As the value is p=0.3201,
concluded that the simple fits to the distributional proposed model.

Two-sample inference
T-test for independet samples
Menu STATISTICS TWO-SAMPLE INFERENCE T-TEST, allows testing the
hypothesis about hope of random variable defined as a difference of sample means. To be
available two independent samples is assumed, each one from a population or distribution.
The test can be seen as a tool for mean (hopes) comparison in two populations
(distributions), i. e.:
H0: E(X1) = E(X2); vs. H1: E(X1) E(X2)
Invoking this test, in the variables selector of the window T test for independent samples
shall specified response variable too in the sub window Variables, the variable that will be
used to identify both samples in the sub window Class variables. The variable of
classification must permit to classify observations in two groups. For example, in the next
window is possible to see that in Variables was selected Weight and as Class variables
was Month.

60

Statistics

If more than one column of the file is specified in the sub-window Class variables and/or
more than one column in the sub-window Variables, InfoStat show results for T tests
associated with each criterion classification and/or each variable under study separately. The
window T test for independent samples that is displayed when selecting OK
the selection of variables allows to specify the type of Test, the type of Comparisons, the
information to be displayed and to require a test of homogeneity of variances.
The test can be bilateral, unilateral left or unilateral right. If there are more than two groups
or samples may be getting all the T tests of mean pairs selecting the option All pair-wise
contrasts. Also InfoStat allows to compare mean of one of the groups with respect to the
mean of all other groups selecting the option Selection vs others. To indicate which group
was selected you need to do double click on the name of the group in the sub-window
Groups. If you have more than two samples, and you want to skip someone, you need to do
double click on the name of the group in the sub-window Groups, with All pair-wise
constrats enabled, and so this group is not taken into account in the analysis and is
displayed in the sub-window Omitted groups. Regarding the information to be displayed as
a result, the field Conf. Inter. allows to apply for the building of a confidence interval for
the difference population means with confidence coefficient determined by the user, the
fields T, df and p, when activated, allow the display of statistical test, degrees of freedom of
its distribution and the p-value hypothesis testing performed.
If the test of homogeneity of variance is
required, InfoStat will select a T statistic for
heterogeneous variances or homogeneous
variances according to the test result. The
61

Statistics

user can specify the significance level to use in the test of homogeneity of variances.
In the window Results, you can see the classification variables that determine each group
and the sample size in population 1 and 2 (denoted by n1 and n2 respectively). Mean (1),
mean (2), var(1) and var(2) area the means and variances of groups 1 and 2 respectively.
Example 5: In a study to analyze the evolution of stored tubers, you wanted to compare
two harvest seasons: April and August, which determine different periods of storage
The study variable was the weight loss by dehydration (in grams). Month file contains the
observations. The following table shows the results.
Table 13: Results of T test for equality of means. Month file.
T test for independent samples
Class Variable Group 1 Group 2 n(1) n(2) Mean(1) Mean(2) pVarHom T
p-value
Test
Month Weight
{April} {August}
10
10
40.89
26.65 0.6648 8.21 <0.0001 Two tails

The value p <0.0001 indicates that there are differences between the two periods in
terms of weight loss expected. The sample means suggest a lower average weight loss for
the month of August.
Because they asked to test for homogeneity of variances, InfoStat contrast the following
hypotheses: H0: 12 = 22 vs. H1: 12 22
To this test is used the statistic F=

S12
S22

that under H0 is distributes as F variable with (n11)

y (n21) degrees of freedom. The column pVarHom shows the p-value for test of
homogeneity of variances. In this example, we do not reject the null hypothesis of
homogeneity of variance (nominal significance level =0.05).
Since the hypothesis test indicated homogeneous
obtained from the following expression:
T=

(X

variances,

the statistic T = 8.21 is

(n 1 1) S + (n 2 1) S 2 1 1
1
2
+
n n
n1 + n 2 2
2
1
2

The p-value is calculated from a T-Student distribution with (n1 + n2-2) degrees of freedom.
When the hypothesis of homogeneity of variances is rejected, the test is based on the
statistic:
T =

62

(X

X 2 ) ( 1 2)
S

X 1 X 2

Statistics

where: S

X 1 X 2

S12
n1

S22
.
n2

In the last case the p value is calculated from a T-Student distribution with degrees of
freedom, calculated from the following expression:

(S

S12
n1

S 22
n2

) + (S

n1

n1 + 1

2
2

n2

n2 +1

Wilcoxon test (Mann-Whitney U)


Menu STATISTICS TWO-SAMPLE INFERENCE WILCOXON (MANNWHITNEY U), allows to test the hypothesis that two independent samples ({X1,...,Xn1} and
{Y1,...,Yn2}), come to the same population using the Wilcoxon statistic (Lehman, 1975).
This test is equivalent to the Mann Whitney U test for independent samples. Both are non
parametric proposals based on the rank of the original observations.
The hypothesis being tested is that the underlying distribution functions (F (x) and G (y)),
have the same parameter of position. Under the alternative hypothesis exists a shift (delta)
of a distribution with respect to another. That is:
H0: F(x) = G(y)
and possible alternatives are based on the model G (y) = F (x-), where is the shift
parameter (under the null hypothesis = 0). The test is based on the statistic W which is the
sum of the ranks in the smaller sample size. Ranges are derived from combining data from
both samples. W value in the results table corresponds to a standardized version of the
statistic W based on the asymptotic distribution of the same. When the smaller of the sample
sizes under consideration is large enough, the statistic W is distributed as a standard normal
variable. InfoStat considers cases of ties by hanging the variance of the statistic (Hollander
and Wolfe, 1999).
For this test the file must contain two
columns, one indicating the variable values
and other criteria for classification as was
shown for the test T. When you click on OK
is displayed a screen where the user must
select the information that wants to contain
the window Results.

63

Statistics

The output window may present, for each of the samples into account (group 1 and 2), the
sample size (n (1) n (2)), the mean (mean (1) and (2)), standard deviation (SD(1) and
SD(2)), the average ranges (R-mean(1) and R-mean(2)) and the original sample median
(median(1) and median(2)), the statistic W and p-value.
InfoStat allows to obtain exact p-values (for this, select the field Exact), the p-value will be
calculated from the distribution of the statistic for all possible samples. When the field
Exact is not activated, standardized version of the statistic W from which we obtain the
approximate test (if there are no ties) for large samples is:
W* =

W E (W )
n(1)(n(2) + n(1) + 1)
where: E (W ) =
and S (W ) =
S (W )
2

n(1)n(2)(n(2) + n(1) + 1)
12

For Month file, using the exact test we can obtain the next results:
Table 14: Results for Wilcoxon test. File Month.
Wilcoxon test for independent samples
Class

Variable

Group 1

Group 2

Month

Weight

April

August

n(1)
10

n(2)

Mean(1)

10

40.89

Mean(2)
26.65

SD(1)
3.58

SD(2)

W p(2 tails)

4.15

155.00

0.0002

Since p<0.05, the null hypothesis is rejected, distributions do not have the same parameter
of position.

Wald-Wolfowitz test
Menu STATISTICS TWO-SAMPLE INFERENCE WALD-WOLFOWITZ TEST.
This non parametric test can be applied to determinate if two independent samples come
from the same population against the alternative that the two groups differ in some aspect,
either central tendency or variability. The statistic is based on the number of runs. When the
sample sizes tend to infinity, Wald and Wolfowitz shows that the standardization of R
statistic tends to normal standard distribution (Lehmann, 1975). InfoStat uses the normal
approximation to obtain p-values.

Van Der Waerden test (normal scores)


Menu STATISTICS TWO-SAMPLE INFERENCE VAN DER WAERDEN TEST,
allows comparison two distributions continuous random variables. It is assumed that two
independent samples are available. The hypothesis is:
H0: F(x) = G(y) vs. H1: 0
where is the shift parameter.
This is a competitor test of Wilcoxon to prove equality of parameters of position in two
distributions.
64

Statistics

n
The Van Der Waerden statistic is c = 1
j =1

N +1
Sj

where c, is a sum since j=1,...,n quantiles (Sj / N+1) of an accumulated standard normal
distribution ( ), n is smaller simple sizes taken in considering, and N is the total of
observations.
When the smaller of the simple sizes under consideration is large enough, VW statistic
(standardization of c), is distributed as normal standard variable (Hollander and Wolfe,
1999). InfoStat uses a normal approximation. The results window presents, for each
considered samples (group 1 and 2), sample size (n(1) and n(2)), mean (mean(1) and
mean(2)), standard deviation (SD(1) and SD(2)), original sample median (median(1) and
media(2)), and c values (sum Zscore(1) and sum Zscore(2)). Also, you can ask for VW
statistic that is calculated as:
VW =

c
,
S (c )

where S (c) =

n(1)n(2) 1 i
+
1
n
)
(
i =1
is the shift parameter.
N ( N 1)

BellDoksum test (normal scores)


Menu STATISTICS TWO-SAMPLE INFERENCE BELLDOKSUM-TEST, allows
comparison two distributions of continuous random variables. It is assumed that two
independent samples are available. This is a non parametrical test. The hypothesis is:
H0: F(x) = G(y) vs. H1: 0
where is the shift parameter.
For this test the observed values are replaced by normal scores (Z), we select a random
sample of size n from a table of values for the standard normal distribution. The values were
sorted in ascending and then replaced by the values observed in the order previously
ordered. This is done for each group independently. InfoStat performed on standard scores
(Z) a T-test for two independent samples, for large samples using normal approximation.
The statistic with approximate standard normal distribution is calculated as:
Z=

T1 T2
1
1
+
n(1) n(2)

where T1 and T2 represent normal score averages in groups 1 and 2 respectively.


65

Statistics

Example 6: An experiment was conducted in which reaction time (in minutes) over a drug
was recorded, for two groups of guines pigs: Males (1) and Females (2). (BellDoksum file).
Table 15: Results for Bell-Doksum test. File Bell-Doksum.
Bell-Doksum (Normal scores)
Class Variable Group 1 Group 2 n(1) n(2) Mean(1)
Mean(2) SD(1) SD(2) T p(2 tails)
Sex
Minutes
1
2
5
5
26.40 44.00 5.68 7.62 -4.51
0.0020

Kolmogorov-Smirnov test
Menu STATISTICS TWO-SAMPLE INFERENCE KOLMOGOROV-SMIRNOV,
allows comparison if two samples come from the same distribution. It is assumed that two
independent samples are available. The hypotheses are:
H0: F(x) = G(x) vs. H1: F(x) G(x), for at least one x.
The test is sensitive to any discrepancy between the distributions (dispersion, position,
symmetry, etc.). The statistic is based on the maximum difference between the two
distributions. The statistic KS is Kolmogorov-Smirnov:
KS
=

n(1) n(2)
d

max

( <t < )

Fn (1) ( t ) Gn ( 2 ) ( t ) }

where n(1) and n(2) are the sample sizes and d is the greatest common divisor of n(1) and
n(2). InfoStat uses the normal approximation based on the asymptotic distribution of KS
properly normalized (Hollander and Wolfe, 1999). For each sample can also request the
mean, median and standard deviation (SD).

Irwin-Fisher test
Menu STATISTICS TWO-SAMPLE INFERENCE IRWIN-FISHER, allows
comparison two random samples that come from independent populations. It is a procedure
for dichotomous variables based on hypergeometric distribution. This test allows contrasting
the hypothesis of equal successful proportions, p1 and p2, in both populations:
H0: p1=p2=p0 vs. H1: p1p2
where p0 is the proposed value for the parameter that is supposed common to both
distributions.
Example 7: We are studying the response of students to a method of improving grammar in
language learning. One group was randomly assigned to experimental treatment (A) and
other to control treatment (B). Subsequently, all underwent the same evaluation that was to
record the time (in min.) that delayed the students to read a paragraph and answer the
questions associated with it. The results for the two study groups make up Language file.
In the evaluation, each student is above or below a determinate level, as for example the
median of pooled sample, which in this case is 3.2.
66

Statistics

the same for the experimental treatment and control is:


= p 1 p 2

with p =
1

t1
t
and p = 2
n(1)
n(2)
2

where n(1) and n(2) are the sample sizes, t1 and t2 represent the number of items n(1) and
n(2) that are above (or below) the level. InfoStat calculates exact p-values for this test when
the sample size does not exceed 50. For large sample sizes InfoStat uses the normal
approximation:
Z=

p1 p 2
p 0 (1 p 0 ) p 0 (1 p 0 )
+
n(1)
n(2)

InfoStat can automatically construct dummies from a quantitative variable selected by the
user definition of the criteria used to decide whether a value should be considered as success
(1) or failed (0). Comparing each value with its mean, median or an arbitrary value or values
larger, smaller or equal to them, may be indicated as a criterion for dichotomizing. The test
statistic is calculated as a function of the dichotomous variable created thereby.
For the example presented and choosing as an option values greater than or equal to the
mean in the sub-window Show the following information was obtained following table:
Table 16: Irwin-Fisher test to compare proportions. File language.
Irwin-Fisher Test
Consider as success values greater than the median
Class Variable
Group 1 Group 2 n(1)
n(2)
p1
Set
Minutes
A
B
8
11
1.00

p2
0.09

p1-p2
0.91

p(2 tails)
0.0001

Median test

Menu STATISTICS TWO-SAMPLE INFERENCE MEDIAN TEST, allows a test of


homogeneity of proportions that is a special case of Irwin-Fisher test in which the median is
used as a criterion for deciding whether the observation should be considered a success or as
a failure. It is assumed that each sample consists of independent observations and identically
distributed from continuous distributions.
For each sample can be obtained P(X1>Med) and P(X2> Med), the number of observations
that are above the median calculated from the common sample (Med). In addition, InfoStat
provides the means (mean(1) and mean(2)), standard deviations (SD(1) and SD(2)) and the
medians (median(1) and medium(2)).
The hypotheses are:
H0: p1=p2=p0 vs. H1: p1p2
67

Statistics

where p0 is the hypothesized value for the proportion of cases above the median of the
pooled sample, believed to be common for both samples.
For larger samples using the normal approximation InfoStat presented to the Irwin-Fisher
test.
Following the example of Language file, if you apply the median test, we get the following
results:
Table 17: Median test for two-sample. File Languages.
Two samples median test
Class

Variable

Group 1

Group 2

Set

Minutes

n(1)

n(2)

11

Med

P(X1>Med)

P(X2>Med)

p(2 tails)

1.00

0.09

0.0001

18.00

Differences in proportions test


Menu STATISTICS TWO-SAMPLE INFERENCE INFERENCES IN
PROPORTIONS, allows the hypothesis of equal proportions of success in two populations:
H0: p1=p2 vs. H1: p1p2
Reported p-values are obtained from exact distribution of Fisher statistic (Marascuilo,
1977).
Example 8: A survey company believes that the proportion of voters (in the next election) of
the North is different from the proportion of voters in the South. A survey of voters in both
areas was made with the following results:
Zone

Sample size

North
South

1400
1800

Number of persons that


will vote in the elections
1200
1400

In this case you need an InfoStat data file, simply


complete the edit fields with the required information
regarding the two sample sizes available and number
of successes observed in each of them. By activating
the Calculate button you get the difference of
proportions and p-value, as shown on the screen.

68

Statistics

Paired T-test
Menu STATISTICS TWO-SAMPLE INFERENCE PAIRED T-TEST. This allows
you to test the hypothesis of equal means when making observations in pairs from the two
distributions that you want to compare. It means that it has a sample size of n pairs of
observations, each member of a pair come from a distribution. The test is based on the
variable distribution of difference between pairs of observations, d.
If the null hypothesis that is required to test is H0: 1-2= 0, this implies d = 0, where d is
the hope of the difference variance, to test this hypothesis the statistic used is:
T=

d
Sd

T(n-1)
n
n

1
Where n is the number of pairs, d = d i and S d =
n i =1
n

di d

i =1

n 1

with di= differences

between recorded observations in the i-th sampling unity.


Note: this test in InfoStat requires a file with two columns: one for observations from the distribution
1 and the other for distribution 2.

Example 9: To study the effect of pollination on the average weight of the obtained seeds, an
experiment was conducted on 10 plants. Half of each plant was pollinated and the other half
not. Seeds from each half were weighted separately; each plant was recorded a couple of
observations. Pollination file contains the values recorded in the study, how they should be
admitted into InfoStat.
Table 18: Paired t-Test results. File Pollination.
T test (paired samples)
Obs(1)
Pollinated

Obs(2)
No_Pollinated

N
10

mean(dif)
0.45

SD(dif) T
0.17 8.42

Two tails
<0.0001

On output, mean(diff) and SD(diff) correspond to the mean and standard deviation of the
difference variable. The value p<0.0001 suggests the rejection of the hypothesis H0: d = 0,
i.e. there are significant differences between the weights of seeds from pollinated and not
pollinated flowers, the average weight differences of the pollinated and not pollinated
flowers is significantly different from zero.

69

Statistics

Wilcoxon test (paired observations)


Menu STATISTICS TWO-SAMPLE INFERENCE WILCOXON TEST allows a test
for comparing two distributions, which differ in their parameter of position eventually when
it is available two samples paired observations. If F(.) and G(.) represent the distribution
functions of X and Y respectively, the Wilcoxon tests H0: F(x) = G(y) vs. H1: F(x) = G(y-)
with 0, representing the shift parameter. That is, the statistic tests the hypothesis that the
distributions of X and Y, since in every sample unit are recorded values of X and Y (e.g.,
reaction before (X) and after (Y) of a treatment) are equal except possibly for a change in
the position parameter. The test uses the magnitude and sign of differences between pairs of
observations. Given a set of paired observations (Xi, Yi), i = 1,..., n, the procedure calculates
Di= (Xi-Yi), the absolute values of differences, and they would apply the transformation
range.
Ri=range XiYi =position in the ordered sample of Di
Later ranges are associated with the signs of the original differences. This test assumes that
Di distribution is symmetric, the Di are mutually independent with the same hope.
The test statistic Wilcoxon is the rank sum for Di> 0 and is denoted as T (+) = sum R (+).
InfoStat also provides hope and variance of the positive ranges under H0. The p-value is
obtained by normal approximation.
Example 10: You want to compare two educational projects to be implemented in two
schools (A and B), using the results of one final evaluation on a group of individuals who
receive education through these two methods. They are chosen randomly from a population
of interest (students of the fourth year of secondary), 14 students, forming 7 pairs based on
their average grades, which are grouped into 7 categories. The members of each pair were
randomly assigned to the method of teaching. After a training period of one year they are
assessed. The data are presented in the Score file.
Table 19: Wilcoxon paired test. File Score.
Wilcoxon test (paired samples)
P-value estimated by Bootstrap
Obs(1)
Obs(2)
N
School A
School B
7

Sum(R+) E(R+)
27.00 14.00

Var(R+) Bt p(2 tails)


34.63
0.0432

For a significance level =0.05, there are differences statistically significant between both
teaching methods.

Sign test
Menu STATISTICS TWO-SAMPLE INFERENCE SIGN TEST allows a test for
distributional equality of hope in situations where there are two samples paired
observations. Unlike the Wilcoxon test, it works only with the sign of the differences
between the n pairs of observations (X, Y):
Di=Xi Yi with i=1,...,n
70

Statistics

Wether the difference is zero, it is consider as a tie and is not included in the analysis. The
null hypothesis is H0: P (positive signs)=P (negative signs) = 1/2.
Following the example of Score file, the Di are {5,4,4,2,5,0,2}, with mean 3.14 and the
corresponding signs are: {+,+,+,+,+,0,+}. The results in this test for data in the example are
presented below:
Table 20: Sign test for non-independent samples. File Score.
Sign test
Obs(1)
School A

Obs(2)
School B

N
7

N(+)
6

N(-)
0

mean(dif)
3.14

p(2 tails)
0.0313

N(+) represents the number of positive differences, N(-) the number of negative differences,
mean(diff) and SD(dif) are respectively the mean and standard deviation of the difference
variable. The results suggest that there are differences between the two methods (p=0.0313).
Note: To be able to perform any test involving paired observations, InfoStat requires a file with two
columns, one for each element of the pair of observations. The file will have as many rows as pairs of
observations have been recorded.

F-test for two variances


Menu STATISTICS TWO-SAMPLE INFERENCE F-TEST FOR TWO
VARIANCES allows a test for contrating equal variances of two distributions.
InfoStat contrasts the hypotheses H0: 12 = 22 vs. H1: 12 22 using the statistic
F=

S12
S22

that under H0 is distributed as F variable with (n11) and (n21) degrees of

freedom.
With data from example 3 (Month file) was realized a F test of homogeneity of variances and the
results are chosen in the following table:

Table 21: Homogeneity of variances F test. File Months.


F-test for two variances
VariableGroup(1)Group(2)n(1)n(2)Var(1)Var(2) F
Weight

{April} {August}

10

10 12.81

Test
17.25 0.74

0.6648 Two tails

The observed p-value (p=0.6648) indicates that variances are homogeneous.

Analysis of variance
Analysis of Variance (ANOVA), allows for testing hypotheses about the position
parameters (hope) of two or more distributions. The hypothesis being tested is usually
71

Statistics

established with respect to the means of the populations under study or of each of the
treatments evaluated in an experiment:
H0: 1 =2 =...=a

with i=1,...,a

where a=number of populations or treatments.


The ANOVA is a procedure that decomposes the total variability in the sample (total sum of
squares of observations) in components (sums of squares) associated with each to a
recognized source of variation (Nelder, 1994, Searle, 1971, 1987).
In experiments for comparison, is usually performed applying various treatments to a set of
experimental units to evaluate and compare the responses obtained under each treatment. In
this case it is desirable to efficiently manage the resources that can increase the accuracy of
estimates of the average responses of treatments and comparisons between them. Treatment
means a/the actions are applied to experimental units that are being compared. Treatments
may be represented by the levels of a factor or a combination of the levels of two or more
factors (factorial structure of treatments).
One of the main objectives in the planning of an experience, following an experimental
design is to reduce the error or variability between experimental units receiving the same
treatment, in order to increase accuracy and sensitivity at the time of inference, such that
related to the comparison of treatment effects.
The experimental design is a strategy of combining the structure of treatments (factors of
interest) with the structure of experimental units (plots, individuals, pots, etc.), so that
alterations in the responses, at least in one subgroup of experimental units, can be attributed
only to the action of the treatments except for random variations. It is possible to compare
(comparison) treatment means or linear combinations of treatment means with the least
"noise"as possible.
To perform an ANOVA in InfoStat should be noted the variables of the file that represents
one or more dependent variables, or classification variabls, or covariates if they exist. The
dependent variable is the variable to be examined (response variable), such as crop yield. If
more than one dependent variable is specified, InfoStat will realize the analysis of variance
for each dependent variable separately.
The classification variables are the variables involved on the right side of the equation of the
ANOVA statistical model and represent factors or sources of variation for separating or
classifying observations into file in groups. Normally there are factors that have to do with
the treatments structure of the experiment and factors related to the structure of the
experimental units. Both must be marked as classification variables.
The variables listed as covariates (or concomitant variables) represent continuous random
variables whose value varies with each experimental unit that possibly are related linearly to
the response variable. In situations which have identified the presence of a concomitant
72

Statistics

variable, InfoStat can perform analysis of covariance, i.e., adjust or remove variability in the
dependent variable due to the covariate before testing differences between treatments.
Below is the Analysis of variance window that lets you choose the variables set to perform
this analysis.
In this window, the performance represents the Dependent variable and to cultivate is the
Classification variable. This example could
come from an experiment to compare the
performance of two or more cultivars and
where there has been no covariate.
InfoStat uses the least squares method to
adjust the general linear model allowing you
to specify more than one criterion of
classification and their interactions (crossed
factors) as well as nesting structures (nested
factors). With this type of model can be
analyzed experiments with a single factor or
multiple factors or sources of variance
(Cochran and Cox, 1957, Anderson and
McLean, 1974, Ostle, 1977; Hinkelmann and
Kempthorne, 1994, Di Rienzo et al. 2001).
When facing the normal equations to obtain
estimates of the parameters of the model, then is possible to find linear dependencies, so
there is no single solution to the system of equations (Graybill, 1961, Hocking, 1996). To
obtain a solution InfoStat uses the usual restrictions: sum of the effects of different levels of
a factor equal to zero. This kind of restrictions on the effects of the factors in the model has
a simple interpretation if we define the effects of levels of factors as deviations about the
mean. By imposing these restrictions, we obtain solutions for fixed parameters of the model.
These solutions are used to obtain the predicted value for each observation. InfoStat also
calculated the residuals for each observation as the difference between the observed and
predicted value by the model. The predicted and residuals values of each observation can be
attached to the active data table.
InfoStat works in this menu, fixed-effects models. However, it allows the specification of
individual error terms for testing hypotheses about model terms. Thus, the user can try fixed
effects models, random or mixed ones if you know the expectations of mean squares for
each term in the model.
The sums of squares presented in the ANOVA tables are by default the sums of squares type
III. These sums of squares are called partials and reflect the contribution of each term in the
model, since all other terms are also present in the model. There ir an option to obtain the
type I sums of squares. The type I sums of squares are called sequential and depend on the
73

Statistics

order in which the terms of the model are declared, because it represents a reduction in the
error sum of squares of each term taking into account the reduction due to the search terms
above. The type I sums of squares are used when the order of the terms of the model is
related to a hierarchy that is useful for interpretation like completely nested models or
models with polynomial terms.

Model
The data to analysize may come from tests conducted under different experimental designs
(completely randomized, randomized complete block, balanced incomplete blocks, Latin
square, crossover, split plot, nested etc.) (Snedecor, 1956; Ostle, 1977, Di Rienzo, et al.,
2001). The differences in the analysis of data from different designs are introduced to
specify the model for the observed variable. InfoStat requires identification of the linear
model used in the analysis, Analysis of variance window, flap Model. Variables declared
as Class variables and Covariates appear automatically in the subwindow Specification of
model terms for the user build the particular model that you want to adjust. terms (overall
average) and (random error), present in all ANOVA models, need not be specified.
Below is a brief description of the main features of experimental designs commonly used
and the method to be followed, in InfoStat, for specifying the associated model.

Completely random design


A suppose is that experimental units are homogenous, i.e. they do not have any structure.
Treatments area assigned completely random to experimental units. Data file has to contain
at least two columns, one identifying the treatment (classification variable) and another one
to response variable (dependent variable). The number of repetitions is can vary from
treatment to another. Linear model for treatments observations i in the plot j, Yij adjusted
by InfoStat is:
Yij = +i +ij
where:
Yij treatment observation i in plot j
i treatment effect i
ij random error asociated to the observation Yij
Usually we asumed that error ter mis distributed normally with mean zero and constant
variance for each observation.
Example 2: For buying 4 cultivars of corn (treatments) is realized an experiment under
completely random design with 10 repetitions or plots per treatment. Response variable is
performance. Data are in Hybrids file.
74

Statistics

Hybrids data file contains two columns, one identifying the treatment (hybrid), and another
to the observed response (yield). To do the analysis selects Menu STATISTICS
ANALYSIS OF VARIANCE. If the window selector variables Analysis of variance was
declared "hybrid" as a Class variable and "yield"as Dependent variable, the next window
Analysis of variance indicate that the variable "hybrid" has been selected as the only
variable classification and this variable also appears in the subwindow Specification of
model terms, which is a one-way analysis of classification.
In this window you can also specify
whether to save the residual, predicted
values by the model, Student residual,
and/or the absolute residual values
which are useful for subsequent
evaluation of the adjustment made to the
format specified (see Assumptions).
When you ask for these measures are
automatically created in the data file
columns containing the requested
information. Overwrite field must be
activated when there are already in the
file, columns containing residual and
predicted values from a previous run and
do not wish to retain.
The window of analysis of variance presented in addition to the Model flap, the flaps
Comparisons and Contrasts. These allow the user to select the method of multiple
comparisons of means that you want to perform a posteriori of the analysis of variance (see
Multiple Comparisons) and to establish certain ad hoc contrasts between means of different
levels of classification (see Contrasts).
Window that appears when you select the flap Comparisons in Means to show by is
chosen to compare the factor which you
want to compare the means (in this
example you want to compare in pairs
means of hybrids, so you must select
"hybrid"). For being a design to one-way
classification, any of the options available,
"hybrid" and all means, make the desired
comparisons. If there are several factors,
options available include the name of each
factor and all means. Selecting a
particular factor will obtain comparisons
to pairs of means corresponding to each
factor level selected. Selecting all means,
InfoStat will report the comparisons of
75

Statistics

means for all treatments defined by the combination of the levels of the factors involved.
For example, in this issue, we selected the method proposed by Fisher multiple comparisons
(Fisher LSD) to compare the means of hybrids in pairs. This window also specified that the
results are presented in List order, in this case InfoStat reports the means sorted from
lowest to highest and accompanied by a letter, so that means having the same letter show no
statistically significant differences including the significance level proposed in the field
(default 0.05, but you can specify 0.01 or Other). Other option is not available for testing
DGC, Tukey, Duncan and SNK.

76

Statistics

Table 22: Analysis of variance output for a completely randomized design. File Hybrids.
Analysis of variance
Variable
N
Yield
40

R
0.321

Adj R
CV
0.265 23.726

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
Model 10026.830
3
3342.277
Hybrid 10026.830
3
3342.277
Error 21194.845
36
588.746
Total 31221.676
39

F
5.677
5.677

p-value
0.0027
0.0027

Test:Fisher LSD Alpha:=0.05 LSD:=22.00731


Error: 588.7457 df: 36
Hybrid Means n
S.E.
2
76.680 10
7.673 A
4
105.437 10
7.673
B
1
106.901 10
7.673
B
3
120.057 10
7.673
B
Different letters indicate significant difference between location parameters (p<=
0.05)

For this example, the value of the ANOVA p=0.003 suggests the rejection of the hypothesis
of equal treatment means, i.e., there are significant differences among hybrids considering
the variable yield. According to Fisher LSD test hybrid 2 presents significant differences
with respect to the other, and because these are the major field, you should either.
Verification of the assumptions made about the error term and comparison of treatment
means usually accompany this type of output (see assumptions of ANOVA, multiple
comparisons and contrasts).

Block design
When there is variability between experimental units, groups of homogeneous experimental
units can be seen as building blocks to implement the experimental strategy known as Block
Design. The first block indicates that the experimental units within each block or group
should be similar to each other (homogeneity within block) and blocks should be different
from each other (heterogeneity between blocks). I.e., blocking or grouping of the
experimental material should be such that experimental units within a block are as
homogeneous as possible and blocks should be designed so that differences between
experimental units are explained in greater proportion, by differences between blocks. When
the design has been conducted in blocks, the model for each observation must include a term
representing the effect of the block to which the observation belongs. It is possible to
eliminate from the comparisons between units receiving different treatments, variations due
to the present structure between plots (blocks).
If each block has as many units as experimental treatments and all treatments are assigned
randomly within each block design is called Randomized Complete Block Design (RCBD).
It says the design is in complete blocks because each block displayed all treatments, and
randomly within each block because the treatments are assigned to plots at random. All
77

Statistics

plots of the same block have the same probability of receiving either treatment. The
variation between blocks does not affect the differences between means because each
treatment appears the same number of times in each block. This design allows greater
precision than the completely random, when its use is justified by the structure of the plots.
The following linear model can be postulated to explain the variation in the response, in
block j receiving treatment i, obtained in a block design with one treatment factor:
Yij = +i+j+ij

con i=1,...,a

where is the general mean, i the effect of the i-th treatment, j the effect of the j-th block
(j = 1,..., b) and ij is the random error associated with observation Yij . Usually the error
terms are assumed normally distributed with zero expectation and common variance 2.
Another assumption that accompanies the model specification for a block design concerns
the additivity (no interaction) effects of blocks and treatments.
Example 3: A trial was conducted to assess the yield in kg of dry matter per hectare of
forage with different inputs of N2 in the form of urea. Urea doses tested were 0 (control), 75,
150, 225 and 300 kg/ha. The trial was conducted in different areas, where soil and climatic
reasons could provide different yields. The areas in this case acted as blocks. The field
design is illustrated in Figure 1. The data is in the Block file.
Block I

225

300

75

150

Block II

300

150

75

225

Block III

75

300

225

150

Block IV

225

150

75

300

Figure 1: Layout of Randomized complete block design. File Block.


The data file for this analysis must contain at least three columns, one identifying the
treatment (levels of urea), another to the blocks and another to the observed response
(dependent variable), in this case the yield. For this analysis, select Menu STATISTICS
ANALYSIS OF VARIANCE. If the window Analysis of variance was declared
"treatment" and "block" as Class variables and "yield"as Dependent variable, the next
window is the selector of variable ANOVA where InfoStat automatically indicate that the
variables "treatment" and "block" has been selected as classification variables and their
names appear in the subwindow Specification of model terms.
Because they have been declared more than one term in the model, the Add interactions
button will automatically appear. To block design with one treatment factor such as this
example, this button will not be activated because the assumption of block-treatment
additivity rightly notes the lack of interaction between the effects of block and treatment.
78

Statistics

Go opens the Results window containing the following information:


Table 23: Analysis of variance output for a Randomized complete block design. File Block.
Anlisis of variance
Variable
N
Yield
20

R
0.94

Adj R CV
0.90 5.83

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
Model
4494763.30
7
642109.04
Block
203319.00
3
67773.00
Treatment
4291444.30
4
1072861.08
Error
309716.50
12
25809.71
Total
4804479.80
19

F
24.88
2.63
41.57

p-value
<0.0001
0.0983
<0.0001

The observed p-value, p<0.0001, is lesser tan the nominal alpha level (suppose =0.05),
indicating that the null hypothesis is rejected, and there is a difference among treatment
means.
Verification of the assumptions made about the error term and comparison of treatment
means usually accompany this type of output (see assumptions of ANOVA, multiple
comparisons and contrasts).
In many situations it is not possible to assign each treatment in each block. When only a
subset of treatments is present in each block, the design is called Incomplete Block Design.
InfoStat also adjusts ANOVA models for designs in incomplete blocks. In this case the
design should be balanced i.e. each treatment must be present, at least one block, with each
of the other treatments. This is to provide estimates of treatment means and differences of
treatment means with the same standard error. The model to be specified for an incomplete
block design is the same as for an RCBD. When performing the ANOVA, InfoStat adjusted
sum of squares of treatments by blocks to remove blocks effect from the of the treatment
means.

Latin square design


In many situations, the experimental units can be grouped according to more than one factor
or source of independent variation of the treatments. Latin square design is used to see
structures plots involving two grouping factors, commonly called row and column factors in
the recognition of sources of systematic variation between experimental units. In a Latin
square each treatment is applied once in each row and once in each column. Then, if a
treatments are tested in Latin square is obtained by ordering a2 plots in a square with a rows
and a columns and assigning a plots to each of the treatments so that each row and each
column has only a repetition each treatment as shown in the figure below:

79

Statistics

Figur2 2: Layout of a latin square design with three treatments, A, B and C.


This design requires a fixed number of repetitions and when the number of treatments is
large, the entire experiment can be unmanageable. The total number of plots is equal to the
square of the number of treatments.
The linear model of ANOVA of an experiment with Latin square design is:
Yijk = +i +j +k +ijk

with i, j, k =1,,a

where Yijk is the observation of the response of the i-th treatment in the j-th column and k-th
row, ijk is the random error term for observation of the i-th treatment in the j-th column and
k-th row. In this model the parameters j and k model the effects of factors associated with
variations in the meaning of the columns and rows respectively. The error terms usually are
assumed normally distributed with zero expectation and common variance 2. Another
assumption that accompanies the specification of the model for a Latin square design
concerns to the additivity (no interaction) effects of rows and columns with the treatments.
Example 13: A trial was conducted to assess the yield in kg of dry matter per hectare of
forage with different inputs of N2 in the form of urea. Urea doses tested were 0 (control),
150 and 300 kg/ha. The trial was conducted in a field experiment with a forest curtain at
north of the same farm,so the light received by the plots varied from north to south. In
addition the lot had a significant slope from west to east. We performed a Latin square
design, variations of the plots in a north-south direction were regarded as variations due to
the column factor (light) and those in west-east direction associated with the row factor
(pending). The data is in the file LatinSquare.
The data file contains four columns, one identifying the treatment (levels of urea), another
row factor (slope), another to the column factor (light) and another to the observed response
(performance). To analysize select Menu STATISTICS ANALYSIS OF VARIANCE.
If the window selector variables Analysis of variance was declared "treatment", "row" and
"column" as Class variables and "yield" as Dependent variable, the next window Analysis
of variance will indicate that the variables "treatment" "row" and "column" has been
selected as classification variables and displayed in the subwindow Specification of model
terms. Go opens the Results window containing the following information:
Table 24: Analysis of variance output for a Latin square design. File Latin square.
Analysis of variance

80

Statistics
Variable
Yield

N
9

R
1.00

Adj R CV
0.99 1.36

Cell-unbalanced data
To get another SS decomposition
define appropriate contrasts
Analysis of variance table (Sequential SS)
S.V.
SS
df
MS
F
p-value
Model
2698.00 6
449.67 161.88 0.0062
Row
28.22 2
14.11
5.08 0.1645
Column
754.89 2
377.44 135.88 0.0073
Treatment
1914.89 2
957.44 344.68 0.0029
Error
5.56 2
2.78
Total
2703.56 8

The value p=0.029 less than the nominal significance level (=0.05) of the test for
treatment effect, means that the value of the F statistic calculated from the experiment is
greater than the theoretical value expected under the hypothesis of equal effects treatments
(0.95 quantile of the F with 2 and 2 degrees of freedom), then concludes with a significance
level of 0.05 that there are differences in yields (kg dry matter produced by the forage)
under different fertilizer treatments or levels with urea.
If the factors to be controlled (experimental unit structures) are three, i.e. in addition to the
row and column effects, there is another effect related to the structure of plots, the resulting
design is often called Greco-Latin. There are other generalizations for this type of
experiments when the number of factors to control is greater than three and any associated
models can be adjusted in InfoStat because the user can add as many criteria for
classification as required.
Verification of the assumptions made about the error term and comparison of treatment
means usually accompany this type of output (see assumptions of ANOVA, multiple
comparisons and contrasts).

Designs with factorial structure of treatments


These designs are used to study the effects of two or more treatment factors and their
interactions generally. In the examples above treatments were defined in relation to different
levels of a single factor of interest (structure of treatments to a way of classification). When
treatments are defined by combining the levels of two or more factors of interest, says that
the experimental design is involving a factorial structure of treatments. The factorial
structure of treatments can be combined with different types of plot structure (completely
randomized blocks, etc.) to generate various experimental designs. InfoStat allows us to
postulate models in which factorial structure of treatments either in the context of a
completely randomized block design, Latin square design, etc. By specifying the model, the
parameters referred to the effects of treatments (which arise from the combination of two or
more factors) should be decomposed into a set of parameters that account for each of the
81

Statistics

factors involved in the definition of a treatment. For example, if the treatment is defined by
the combination of levels of the factor dose and levels of the factor drug will be specified in
the model the dose and type of drug separately. You can add as terms of the model all
possible interactions between factors. In each case, we should judge whether all or only
some of the interactions added to those model are required. If for example a test is
performed with two design factors in a randomized complete block, it is not necessary to
add the interactions of block with each of the two treatment factors.
InfoStat adjusted models for complete factorial experiments. Full factorial experiments, we
study all possible combinations of factor levels (treatments) in each repetition of the
experiment. The additive factor models are those that the terms that model the interaction
are absent. Additive models are used to study the main effects of the factors involved in a
process known that factors do not interact between them. The main effect of a factor is
defined as the average change in response produced between any pair of factor levels
considered.
To illustrate this case is presented a factorial experiment 2x2 (two factors with two levels
each one) in which the interaction is absent, which is arranged as a completely randomized
design. The factors are designated as A and B and their levels as A1, A2 and B1, B2. Because
there are 4 treatments (A1B1, A1B2, A2B1, A2B2) and assuming that these are not repeated,
there are four experimental plots. Since the design is completely randomized the allocation
of plots to each of the treatments is random. A possible arrangement is shown in the figure
below:

A2B1

A1B1

A2B2

A1B2

Figure 3: Completely randomized design with two factors without replication.


The model for this experiment is as follows:
Yij=+i+j+ij ;

con i=1,2; j=1,2

where Yij is the response to the i-th level of factor A and j-th level of factor B, represents
the general mean, i the effect produced by the i-th level of factor A, j represents the effect
of j-th level of factor B and ij is the random error associated with the ij-th observation. ij
values usually are assumed normal, independent with zero expectation and common
variance 2. When factorial experiments do not have replications, the analyst should assume
that the factors do not interact to estimate the variance of experimental error. If this
assumption is not fulfilled then the experiment is poorly designed and the conclusions of the

82

Statistics

analysis may be completely wrong, since the interaction will be confounded with
experimental error.
Example 14: In a trial comparing the effect of water stress and salinity on the germination
of Atriplex cordobensis, seed lots were subjected to four levels of water potential: 0, -0.5,
-1.0 and -1.5 MPa obtained by applying the environment two osmolytes: polyethylene glycol
(PEG) and sodium chloride (NaCl). The experiment was conducted under a completely
randomized design without repetition. The results are presented in Factoria11 file.
The data file contains three columns, one identifying the treatment factor A (water stress,
water potential), another treatment factor B (salt stress, osmolytes) and another to the
observed response (germination). For this analysis, select Menu STATISTICS
ANALYSIS OF VARIANCE. If the window selector variables Analysis of variance was
declared "Factor A" and "Factor B" as a Class variables and "Germination%" as
Dependent variable, the next window variance Analysis of variance will indicate that the
variables "Factor A" and "Factor B" have been selected as classification variables and will
display in the subwindow Specification of model terms. As there is more than one
classification factor will automatically appear the Add interactions button. In this case as
there are not repetitions of interaction cannot be evaluated and therefore should not be
activated this button. To Go (without adding interactions), will open a window containing
the following results:
Table 25: Analysis of variance output for a two factor arrangement of treatments. File Factorial1.
Analysis of variance
Variable
N
Germination%
8

R
1.00

Adj R CV
0.99 5.43

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
p-value
Model
6568.50 4
1642.13 182.46 0.0007
Factor A
6518.50 3
2172.83 241.43 0.0004
Factor B
50.00 1
50.00 5.56 0.0997
Error
27.00 3
9.00
Total
6595.50 7

The p-value p<0.0004, lower than the nominal significance level of test (=0.05) for the
effect of factor A implies that, in the domain studied, this factor is statistically different
from zero on the average germination. Not so for factor B, since p=0.0997 is greater than
the significance level chosen.
Verification of the assumptions made about the error term and comparison of treatment
means usually accompany this type of output. Tests of multiple comparisons of means for
factor A are needed, because the value p=0.0004 only reject the hypothesis of equal means
between 4 levels of water potential, but is not known which are different (see Assumptions
of ANOVA , Multiple comparisons and Contrasts).

83

Statistics

You can find that the difference in response between two levels of a factor is not the same
for different levels of the other factors, if this happens it is said that there is "interaction"
between the factors. If the experimenter believed or suspected that the answer to two or
more factors cannot be explained as the sum of the individual effects of these factors, then
the model for the factorial experiment should include interaction terms to account for this
fact. The inclusion of interaction terms in the model implies the need of replications for each
treatment because otherwise it is not possible to estimate the additional parameters
(interactions). When the experiment has two factors, there are only first order interactions,
where it has three factors, there are interactions of first and second order and so there factor
structures involving higher order interactions.
The model for an experiment with two-factor interactions is an extension of two-factor
model for the experiment described above, except that it includes an additional set of
parameters, known as interaction parameters:
Yijk=+i+j+ij+ijk

with i=1,2; j=1,2; k=1,..,nij

where Yijk represents the response of the k-th repetition in the i-th level of factor A and j-th
level of factor B, represents a general mean, i the effect produced by the i-th level of
factor A, j corresponds to the effect of the j-th level of factor B, ij additional effect
(interaction) for the combination of levels i of factor A and j of factor B and ijk is the
random error associated with the ijk-th observation.
ijk terms are usually involved independent and normal distributed with zero expectation and
common variance 2. Note that the subscript k moves from 1 to nij, i.e. the number of
repetitions for treatment can be different.
Example 4: In a study on the potential forage cordobensis Atriplex , a shrub that grows in
depressions of the arid Chaco of Argentina, we evaluated the protein concentration in
leaves harvested in winter and summer on male and female plants. For every combination
of sex and season were three determinations of protein content measured as a percentage of
dry weight. The results are on Factorial2 file.
The data file contains three columns, one identifying the factor A (sex), another to Factor B
(station) and another to the observed response (protein concentration). For this analysis,
select Menu STATISTICS ANALYSIS OF VARIANCE. If the window selector
variables Analysis of variance was declared "Factor A" and "Factor B" as a Class variable
and "Protein%" as Dependent variable, the next window selector variables Analysis of
variance indicates that the variables "Factor A" and "Factor B" have been selected as
classification variables and displayed in the subwindow Specification of model terms. To
include the interaction between Factor A and Factor B, Add interaction has to be
activated. Go, opens the Results window containing the information provided in the
following table.
Note: Enabling Add interactions will be added automatically as model terms, all possible
interactions between the classification variables. To eliminate any term that you do not want to state
84

Statistics
in the equation of the model and is included in that list, select them and press the Delete key. Not
add to the model unwanted interaction terms, in the subwindow Class variables you can choose
those terms whose interactions are wanting to add to the model and then select the Add interactions
button.

Table 26: Analysis of variance output for a two factor arrangement of treatments. File Factorial2.
Analysis of variance
Variable
Protein%

N
12

R
0.93

Adj R CV
0.91 6.30

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
p-value
Model
198.00 3
66.00 37.71 <0.0001
Factor A
3.00 1
3.00
1.71 0.2268
Factor B
3.00 1
3.00
1.71 0.2268
Factor A*Factor B
192.00 1
192.00 109.71 <0.0001
Error
14.00 8
1.75
Total
212.00 11

As can be seen, the p-value associated with the interaction is highly significant, indicating
that the factors studied do not act independently. For this reason there shall be no
conclusions on the main effects from this table because the presence of interaction could be
affecting the average differences. In this case you should compare the mean levels of factor
A within the treatments received the same level of factor B or vice versa (compare the levels
of B for each level of A separately).
To draw conclusions about the main effect of a factor in the presence of a significant
interaction, the experimenter should examine the levels of this factor, holding fixed the
other levels. The following figure shows a graph obtained in InfoStat, about average values
in the four treatments that can easily interpret the result shown in the table of analysis of
variance.
30

Protein%-M

Male
Female

25

20

Female

Male

15

winter

summer

Figure 4: Mean standard error of protein concentration in Atriplex cordobensis leaves for
the combination of sex and harvest season.
85

Statistics

If some combination of factors is absent (missing cells), InfoStat automatically displays the
sum of squares of type I (sequential) and report in the results window that is an unbalanced
design in cells. These sums are obtained for each term in the sequential model if there is a
subset of data that allows the estimation of the contrast of interest (main effects and
interaction) at least for existing data.

Design with nested structure of treatments


In some experiments, with more than one treatment factor, the levels of a factor (e.g. factor
B) do not represent the same under different levels of another factor (e.g. Factor A). Such an
arrangement is known as a nested design and said that Factor B is nested in the levels of
Factor A if for each level of A exists a particular set of levels of B. If B is nested in A, it
makes no sense to study the interaction between A and B, because B levels are being
evaluated within each level of A. The ANOVA table provides the sum of squares due to
factor A, and the sum of squares due to factor B within A (InfoStat recognizes this condition
when it is declared as A> B). The sum of squares of the term A> B is the sum of the sums of
squares for factor B and interaction BA.
Example 16: A paper company buys paper batch to three logging companies.The study
variable are g/m2of paper obtained in relation to the company and the lot. Each company
looks at four batches of raw material drawn at random and made three determinations per
group. You could say that the factor "lot" is nested within the factor "company", i.e.,
batches from different companies are not the same. In this example, want to know if there
are differences between companies and if lots of companies are coming or not homogeneous
in terms of the response. The data are presented in Nested file.
In order to obtain the ANOVA table using InfoStat you have to perform the following
procedure: choose menu STATISTICS, submenu ANALYSIS OF VARIANCE , and in
the window selector variables Analysis of variance specify the Dependent variable that in
the example is "Value" and the Class variables are "Company" and "Lot". When Go is
enabled the window of Analysis of variance that contained the overlap Model and the two
classification variables appear listed in subwindow Specification of model terms. To
declare the embedded or nested factors should be written in this view object on behalf of the
factors involved are separated by the ">" (greater), i.e. in this case "Company>Lot"
indicating that lot is nested in the factor company. The factor "Company", which is at the
highest hierarchical level, states separately, so in this example, the window should be the
terms "Company" and "Company>Lot".
In this design, as in others, to assess the significance of different sources of variation
(factors model), the error terms to use depends on the nature fixed or random effects of the
model. By default, InfoStat calculated F statistics on the factors reported using the
experimental error term, which is valid if all model effects are fixed.
If, for example, the company is fixed effect and the lot random is a random effect, error term
for the factor "Company" is "Company>Lot". To state this in the model, you have to
86

Statistics

indicate in the window Specification of model terms, adding to the factor company the
character "\" (backslash) and then the corresponding error term, in this case
Company>Lot. The window will then display the words: Company\Company>Lot and
Company>Lot.
Table 27: Analysis of variance output for a two nested factors arrangement. File Nested.
Analysis of variance
Variable
N
R
Adj R CV
Value
36
0.57
0.38 2.16
Analysis of variance table (Partial SS)
S.V.
SS
df
MS
F
Model
84.97 11
7.72
2.93
Company
15.06 2
7.53
0.97
Company>Lot
69.92 9
7.77
2.94
Error
63.33 24
2.64
Total
148.31 35

p-value
(Error)
0.0135
0.4158 (Company>Lot)
0.0167

In this example we will say that the factor company has no significant effect on the g/m2 of
paper (p=0.4158) and that the variance between batches within at least one company is
different from zero (p=0.0167).

Split plot design


This type of design is often used in experiments with more than one treatment factor and
where there are random restrictions that impede the randomization of treatments
(combination of factors) to the experimental units. They are useful when one of the
treatment factors needs to be evaluated, large plots or experimental units and the other
treatment factor can be evaluated on smaller units (subunits).
The design is called split plot because it is usually that one of the factors is associated with
treatment (Factor A) to a larger experimental units (main plots) and within each level of this
factor or what is the same within each principal plot identifies "subplot" or smaller plots on
which randomly assigned the second treatment factor (Factor B). At the same time there
may be other blocking factors (no treatment) defining the structure of the plots. Thus,
among others, you may have a split plot design with completely random plot structure or a
split plot design with plot structure in blocks, i.e. each of the b blocks of the experiment are
identified main plots and subplots . InfoStat can directly manage any of these experimental
structures.
For example, consider an experiment to test differences among four levels of a factor A
using a randomized block design and two levels of a second factor B that is assigned to the
main plot subplots associated with A. Randomization will conduct in two stages: first are
randomized levels of factor A to the main plots within each block, then randomized factor B
levels in the subplots of each main plot. Randomization restrictions imposed in this kind of
experiments need more than one error term. If the design is completely randomized block
then the interaction blockmain plot is the error term for the effect assigned to the main plot
87

Statistics

(error A). If the level design of the main plots was completely randomized (main plots are
not arranged in blocks) you can take the replicates effect within main plot (main
plot>repetitions) as error term for main plot. The error terms to be used will depend on the
structure of plot and structure of treatment. If you have three factors to evaluate with
different size of experimental units then you will be a designed in split-plots and therefore it
will be three terms of error: error A (for the main plot), error B (subplot) and the
experimental error (for the sub-subplots).
Example 17: In a study of wheat, two main plots were arranged in three blocks. On the main
plots were randomized irrigation factor levels and these were divided into four subplots
which were randomized 4 varieties of wheat. The study variable was the performance
measured in kg/plot. For the factor "irrigation" (Factor A) there are two levels: dry (no
irrigation) and irrigation and the factor "variety" (Factor B) were used the following
varieties: Buck-Charra, Las Rosas INTA, Pigue and Pro-Prop INTA.The data (courtesy of
Mr. M. Cantarero, Faculty of Agricultural Sciences, UNC) it is found in the plots file.
To perform the analysis with InfoStat, proceed as follows: select Menu STATISTICS
ANALYSIS OF VARIANCE, and the window Analysis of variance selector variables
specify the Dependent variable in the example, is "Performance" and the Class variables
are: "Parc" identifying the main plot factor (irrigation), "Block" and "Variety". To Go the
next window is enabled for the Analysis of Variance, there appear in the Model tab the
classification variables indicated. You must add to model the interactions ParcBlock (Error
A, to evaluate the effect plots) and the interaction ParcVariety despite the restrictions there
is a factorial treatment structure.

88

Statistics

The interaction between varieties (subplot) and blocks can be added to test the assumption
of no interaction. Some authors suggest that after confirmed the absence of block-subplot
interaction, you may present an analysis without this term in the model to increase the
degrees of freedom of error. With these specifications we obtain the following output:
Table 28: Analysis of variance output for a two cross factors arrangement in a split-plot design with
replications in complete blocks. File Wheat.
Analysis of variance
Variable
Yield

N
108

R
0.76

Adj R CV
0.72 5.94

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
p-value
(Error)
Model.
1515.11 14
108.22 20.71 <0.0001
Block
636.72 2
318.36 60.93 <0.0001
Tillage
533.56 2
266.78 181.21 0.0001 (Block*Tillage)
Block*Tillage
5.89 4
1.47
0.28 0.8891
Varieties
330.39 2
165.19 31.62 <0.0001
Tillage*Varieties
8.56 4
2.14
0.41 0.8015
Error
485.89 93
5.22
Total
2001.00 107

The results suggest there is not interaction Tillage*Varieties , so that the results of the
main effects can be interpreted directly.
89

Statistics

To make comparisons and/or contrasts between the levels of the factors involved, InfoStat
will use for each model term the error specified in Error column.
Example 18: In a cardboard endurance test were performed preparations of basic paste
with three different amounts of water (50, 75 and 100 liters). Each of the preparations
(main plots) was repeated three times in random order over time. Then, the preparations
were divided into four equal fractions (subplots) and those were subjected to different
cooking temperatures (20, 25, 30 and 35 degrees), which were assigned at random. The
study variable was the strength of paperboard made. The data is in the file Split-PlotCRD.
To perform the analysis with InfoStat, proceed as follows: select Menu STATISTICS
ANALYSIS OF VARIANCE, and the window Analysis of variance selector variables
specify the Dependent variable in the example, is "Resistance" and the Class
variables are: "Water" by identifying the main plot factor (of water), "Repetition" and
"Temperature". To Go, the next window enables Analysis of variance, there appear in the
Model tab the classification variables indicated. Must be added to model the term
Water>Repetition (Error A, to assess the effect of water) and the interaction
WaterTemperature and despite the restrictions imposed a factorial structure exists for
treatments. Below is the window with the terms of the proposed model to analyze this
example.

With these specifications is obtained the next output:


90

Statistics
Table 29: Analysis of variance output for a two cross factors arrangement in a split-plot with
replications in completely randomized design. File Split-plotCRD.
Analysis of variance
Variable
N
Resistance
36

R
0.86

Adj R CV
0.72 4.73

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
Model
933.84 17
54.93 6.27
Water
522.75 2
261.37 25.50
Water>Repetition
61.49 6
10.25 1.17
Temperature
248.70 3
82.90 9.46
Water*Temperature
100.90 6
16.82 1.92
Error
157.66 18
8.76
Total
1091.49 35

p-value
(Error)
0.0002
0.0012 (Water>Repetition)
0.3649
0.0006
0.1324

The results suggest that there is not interaction WaterTemperature (p=0.1324), so that the
results of the main effects can be interpreted directly: there is water effect (p=0.0012) and
temperature effect (p=0.0006).

Split-split-plot design
Main plots replicated in a CRB design
The data from the file SplitSplitPlot a came from a randomized complete block design with
3 replications (Block). Each block was divided into three main plots. In each main plot (MP)
were randomly assigned three tillage methods (Tillage, Zero, Minimum and
Conventional). After farming, main plots were divided into three subplots (SP), and in each
were randomly assigned 3 varieties of maize (Variety, levels v1, v2 and v3). Finally, each
subplot was divided into 4 sub-subplots (SSP), and they were randomized 4 types of
fertilizer (Fertilizer, levels A, B, C and D). The variable evaluated was corn yield
(qq/ha). To perform the analysis must be declared as an independent variable to
performance and as classification variables Block, Tillage, Variety and Fertilizer.
The split-plot design involves three levels of randomization, so the analysis should take into
account three different errors: one for the main plot, one for the subplot and one for the subsubplot. In InfoStat only errors should be reported for the MP and SP, as the third of the
errors is declared by default. The error for the main plot is the interaction between block and
the factor that was assigned to the MP, in this example, the method of farming. The error for
the SP is given by the interaction between the block and this factor in the subplot, in this
example Blo*Var, plus the triple interaction of the factors Block, MP and SP, in this case
Block*Tillage*Variety. This sum can be replaced in InfoStat by Tillage>Variety*Block
(Block* Variety +Block*Tillage*Variety = Tillage>Variety*Block)
Then in the flap Model of ANOVA menu, you should write:

91

Statistics

This window has left a space between the terms of a model for MP, SP and SSP respectively
for easy viewing. On the flap Comparisons the test requested was Duncan. Pressing the Go
button will get the following result:
Table 30: Analysis of variance output for a split-split-plot design with replications in complete
blocks. File Split-split-plot.
Analysis of variance
Variable
Yield

N
108

R
0.97

Adj R CV
0.95 2.63

Analysis of variance table (Partial SS)


S.V.
SS
df
Model
1945.83 53
Block
636.72 2
Tillage
533.56 2
Block*Tillage
5.89 4
Varieties
330.39 2
Varieties*Tillage
8.56 4
Tillage>Varieties*Block
8.89 12
Fertilizer
400.78 3
Varieties*Fertilizer
3.39 6
Tillage*Fertilizer
7.56 6
Varieties*Tillage*Fertiliz..
10.11 12
Error
55.17 54
Total
2001.00 107
Test:Fisher LSD Alpha:=0.05 LSD:=0.79403
Error: 1.4722 df: 4
Tillage
Means n
S.E.

92

MS
36.71
318.36
266.78
1.47
165.19
2.14
0.74
133.59
0.56
1.26
0.84
1.02

F
35.94
216.25
181.21
1.44
223.01
2.89
0.73
130.77
0.55
1.23
0.82

p-value
(Error)
<0.0001
0.0001 (Block*Tillage)
0.0001 (Block*Tillage)
0.2331
<0.0001 (Tillage>Varieties*Block)
0.0690 (Tillage>Varieties*Block)
0.7208
<0.0001
0.7656
0.3043
0.6247

Statistics
Conventional
Minimum
Zero

36.33
37.61
41.56

36
36
36

0.17
0.17
0.17

A
B
C

Different letters indicate significant difference between location parameters (p<= 0.05)

Test:Fisher LSD Alpha:=0.05 LSD:=0.44199


Error: 0.7407 df: 12
Varieties
Means n
S.E.
v3
36.75 36
0.17
A
v1
37.86 36
0.17
v2
40.89 36
0.17

B
C

Different letters indicate significant difference between location parameters (p<= 0.05)

Test:Fisher LSD Alpha:=0.05 LSD:=0.55152


Error: 1.0216 df: 54
Fertilizer
Means n
S.E.
A
36.52 27
0.19
A
B
36.63 27
0.19
A
C
40.41 27
0.19
D
40.44 27
0.19

B
B

Different letters indicate significant difference between location parameters (p<= 0.05)

This model contains 3 factors of interest, so the first thing to evaluate is the triple interaction
and double interactions. In this example any interaction is significant therefore it is possible
to evaluate the effects of the factors separately. There are differences between tillage
method (p=0.0001) being recommended zero tillage. There were also differences between
varieties (p<0.0001) recommending the v2. Finally, there are differences between fertilizers
(p<0.0001), and it is recommended either the C or D.

Main plots in a completely randomized design


If the main plots are repeated in a completely randomized design, the only change with
respect to the completely randomized block model is the error for the main plot (MP). In
this case, you must declare the repetitions (Rep) and placed on the flap Model the terms of
the factor model of MP and replications within MP. For the above example (SplitSplitPlot),
assuming that the repetitions of the MP were not completely randomized blocks, the
window Model should be as follows:

93

Statistics

Comparaciones Mltiples
Multiple comparisons
Usually, when the effects of a factor in the ANOVA are considered non-zero, we implement
a multiple comparison test of means. InfoStat provides the sample means under each of the
distributions being compared. The user can specify if wants to compare the means of all
treatments and in case of designs with factorial structure of treatments, the mean levels of
each of the factors involved. To analyze the differences in "pairs" between the means of the
distributions being compared, it is possible to perform a wide variety of tests after or
multiple comparison tests (Hsu 1996, Hsu and Nelson, 1998).
The subwindow Comparisons, available in the ANOVA window, allows you to request
hierarchical multiple comparison procedures (based on hierarchical clustering algorithms)
and traditional (Gonzalez, 2001). Traditional methods usually have lower type I error rate
than procedures based on cluster when working on experiments that do not have good
control of the levels of precision used for means comparison. With a high number of
94

Statistics

treatment means, traditional procedures can produce output difficult to interpret because the
same media can belong to more than one group of means (lack of transitivity). By contrast,
the hierarchical methods for average comparisons produce mutually exclusive groupings
(partition of the set of treatment means). Di Rienzo et al (2001) and Gonzalez L. (2001)
carried out the simultaneous comparison of multiple comparison methods of hierarchical
and nonhierarchical that InfoStat implements.
For any procedure selected, InfoStat allows to define the nominal significance level used for
the selected test. In addition, you can choose the type of presentation of the results of
multiple comparisons (as a list or matrix form). If you request a matrix presentation,
comparisons InfoStat presents a matrix whose elements in lower diagonal will be the
differences between the means and the upper diagonal shows the symbol "*" indicates pairs
of means that differ at the level of significance chosen. If you request list presentation,
comparisons are displayed in a list in which different letters indicate significant differences
between means being compared. The user can also enter a value for estimating the mean
square error (and its degrees of freedom) that you want to be used in the comparison of
means. When the appropriate box has been activated, InfoStat does not use the error terms
used in the last ANOVA as usual.
Below is a brief description of the procedures available in InfoStat.

Traditional methods
Fisher LSD test

Compare the observed differences between each pair of sample averages with the critical
value for T test for two independent samples. When you work with balanced data, this test is
equivalent to testing the Fisher least significant difference for any comparison of means of
main effects. The test does not adjust the significance level simultaneously, so the
experiment error rate can be greater than the nominal level, increasing as increase the
number of treatments to evaluate.
Bonferroni test

It is based on a T test which level has been adjusted according to Bonferroni inequality. It
allows you to control Type I error of simultaneous inference c=k(k-1)/2 contrasts "pairs"
based on, each, T Student statistic, with k=number of treatments compared. The level of
significance of each individual contrast is adjusted according to the number of
comparisons. Each contrast is made with a significance level /c.
Tukey test

It is based on Tukey's statistic which calculates as critical value to identify significant


differences, an amount (DMS) based on the quantile corresponding to the Studentized range
distribution. When the sample sizes are equal, this test controls the experiment error rate
under complete or partial null hypotheses. The test is more conservative (lower type I error)
that the Newman-Keuls test or the Duncan one, consequently can lose power over
95

Statistics

them. When sample sizes are unequal, InfoStat implements the amendment proposed by
Tukey-Cramer (Miller, 1981).
Duncan test

It is also known as multiple range tests and belongs to the type of tests known as multiple
stages. These tests first study the homogeneity of all the k-means to a level of significance
k. If is rejected the hypothesis of homogeneity for k-means, homogeneity is tested in each
subset of k-1 means, using a significance level k-1, otherwise the procedure stops. The
procedure is repeated until a level where it is found that the subset of means involved is
homogeneous. In general, the level of significance in the i-th stage is: i=1-(1-)i-1. Duncan
method controls that the error rate by comparing does not exceed nominal value, but the
experiment error rate can be increased. This increase may lead to a decrease of type II error,
reason why some authors claim that this test is more powerful.
Student-Newman-Keuls test (S.N.K)

It is also a multiple range test (see Duncan). The only differences with the Duncan test are
that the SNK method uses the same level of significance at every stage. The NewmanKeuls test controls the experiment error rate, only when the null hypothesis is complete, i.e.
when all means are equal. However, the experiment error rate approaches one when the
number of treatments increases and there is a greater probability that the null hypothesis is
true only for a subset of means (partial null hypotheses.)

Procedures based on cluster


Di Rienzo, Guzmn and Casanoves Test (DGC)

This means comparison procedure (Di Rienzo, et al., 2002) used the multivariate technique
of cluster analysis (average linkage or UPGMA) on a distance matrix
D={di j}= X i X j /

S2
obtained from the sample means. As a result of cluster analysis a
n

binary tree is obtained in which the hierarchical sequence of cluster formation can be
observed. If Q is designated as the distance between the source and the root node of the tree
(the one in which they are attached all means), InfoStat uses the distribution of Q under the
hypothesis:
H0: 1 = 2 =....= k
to build a test with a significance level . The mean (or average groups) attached on nodes
that are above Q, can be considered statistically different for the significance level . The
method assumes an equal number of repetitions per treatment, otherwise the algorithm
implemented uses the harmonic mean of the number of repetitions. This test well controlled
96

Statistics

the type I error rate by comparing maintaining an acceptable power in well-conducted


experiments (low CV for the mean difference) and improves its general performance as
increase the number of means to compare.
Jollife test

This test (Jollife, 1975) is also an approach to cutting a dendrogram showing similarity
relationships between treatment means. The test consists of applying a cluster analysis based
on the simple linkage algorithm to the matrix of p-values of the Studentized range
distribution applied to the differences between pairs of means.Those who join more than 1 are declared statistically different. This test is the most conservative within hierarchical
processes.
Scott and Knott test

Scott and Knott (1974) were the pioneers in the usage of a criterion for the partition of
clusters in the framework of a mean comparison procedure. The test uses a divisive method
and the threshold criterion is based on the asymptotic distribution of a statistic that
resembles an F-test. Because it performs successive partitions with the same data set, the
significance level set may be different from nominal so it should be taken as an index.
Bautista et al. (BSS) Test

Bautista et al. (1997) proposed a recursive algorithm based on the combination of a


clustering technique and a hierarchical analysis of variance. It has like Scott and Knott the
problem of the significance level of comparisons, which should be taken as an index,
however, in simulation studies (Di Rienzo et al., 2002) their performance, was satisfactory.

Contrasts
The subwindow Contrasts available in the main window of ANOVA allows to obtain the
significant of contrasts assumptions about model parameters.
A contrast is defined as a linear combination of the parameters of the model (Montgomery,
1991).
In the analysis of variance the contrasts usually take the form a1M1 + a2M2+...+akMk (where
the ai coefficients are known constants, at least two non-zero and their sum is zero, Mi is the
i-th population mean). Contrasts allow to compare between pre-planned means to analysis
of variance. For example, if you have three means M1, M2 and M3, the contrast 1 -1 0 will
compare M1 with M2 and contrast 2 -1 -1 is equivalent to comparing M1 to the average of
M2 and M3.
97

Statistics

If you want to pose more than one contrast, for the comparisons are independent of each
other, must be orthogonal contrasts. Two contrasts are orthogonal if the sum of the products
of the coefficients of both is zero. So, for C1=a1M1+a2M2+...+akMk and C2=b1M1+
b2M2+...+bkMk, C1 and C2 are orthogonal if a1b1+a2b2+...+akbk=0. Three or more contrasts
are orthogonal if all pairs of contrasts are orthogonals. InfoStat allows to control the
orthogonality of the contrasts given by checking the option Orthogonality Control.
Example 19: In a study to evaluate performance of wheat were tested five treatments: a
combination of N and K in high and low doses (NhKh, NhKl, NlKh, NlKl) plus a treatment
without fertilizer (control). Note that the five treatments can be seen as a 22 factorial
arrangement with the addition of the witness. The data is in the file Contrast1.
In the next window you can see a set of orthogonal contrasts of interest. The first contrast
compared the control versus the average of the other treatments. The second compares the
average of treatment with low dose of N with the average high-dose treatments. The third
and fourth contrast compares the K factor in the presence of high and low N respectively.

To better identify the contrast in the output, InfoStat allows to name each contrast. This is
accomplished by placing the name of the contrast in the window Matrix of contrasts, and
then after put ":" writes the contrasts of interest. If you do not specify a name for each
contrast, InfoStat call them Contrast 1, Contrast 2, and so on, depending on the order they
were specified.
The following table shows the corresponding output.
Table 31: Analysis of variance output with contrast. File Contrast1.
98

Statistics
Analysis of variance
Variable
Yield

N
15

R
0.945

Adj R CV
0.923 4.191

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
p-value
Model
333.808 4
83.452 43.078 <0.0001
Treatment
333.808 4
83.452 43.078 <0.0001
Error
19.372 10
1.937
Total
353.180 14
Contrasts
Treatment
Contrast1
Contrast2
Contrast3
Contrast4
Total

SS
df
186.920 1
137.458 1
9.414 1
0.016 1
333.808 4

MS
F
186.920 96.489
137.458 70.956
9.414 4.860
0.016 0.008
83.452 43.078

p-value
<0.0001
<0.0001
0.0520
0.9300
<0.0001

Contrasts coefficients
Treatment
Ct.1
Ct.2
Ct.3
Ct.4
NhKh
1.000 1.000 1.000 0.000
NhKl
1.000 1.000 -1.000 0.000
NlKh
1.000 -1.000 0.000 1.000
NlKl
1.000 -1.000 0.000 -1.000
Witness
-4.000 0.000 0.000 0.000

As can be seen, the first two hypotheses are rejected, i.e., the witness differs significantly
from the rest (p<0.0001) and high level of N differs from the low (p<0.0001). The contrast
three hints as significant (p=0.0520) and four is not significant (p=0.9300) which implies
that there is no difference between low and high dose of K for low N.
Through Contrasts subwindow, you can also specify tests for polynomial trends when
treatment factor levels are quantitative and equidistant. Rather than comparing levels in
pairs on all levels of the factor, is more informative, in these cases, to investigate if there are
trends in response due to increased levels of treatment factor. The trend may be reflected by
a linear, quadratic, cubic, etc increase (decrease). The contrast coefficients used must
correspond to the coefficients of polynomials that model the type of trend that is
hypothesized. With a levels of treatment factor to analyze, it is possible to postulate a-1
orthogonal polynomials of order 1, , a-1. Orthogonal polynomial coefficients for various
numbers of treatments can be found in Montgomery (1991).
For example, if you have 5 equally spaced levels for a factor, the sum of squares of contrasts
for the linear effects, quadratic, cubic and fourth order break down the sum of squares of
treatment, in four contrasts with one degree of freedom for each one. The coefficients of the
contrasts, in this example would be:
Table 32: Coefficients for polynomial orthogonal contrasts.
Linear Quadratic 3rd order 4th order
-2
2
-1
1
-1
-1
2
-4
0
-2
0
6
1
-1
-2
-4

99

Statistics
2

Example 20: To study the effect of a new chemical formulation for control of an insect was
made an experience that tested the new design (new) and the standard formulation
(standard) with three doses of each active ingredient (15 , 20 and 25 mg/l). The response
variable is the average number of dead insects per plant. The data is in the file Contrast2.
Following is the ANOVA table for this example:
Table 33: Analysis of variance output for a two cross factors arrangement. File Contrast2.
Analysis of variance
Variable
Response

N
12

R
0.88

Adj R CV
0.78 22.35

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
Model
160.25 5
32.05
8.99
Drug
33.81 1
33.81
9.49
Dose
29.14 2
14.57
4.09
Drug*Dose
97.30 2
48.65 13.65
Error
21.38 6
3.56
Total
181.63 11

p-value
0.0093
0.0217
0.0758
0.0058

As shown there is interaction drug with dose (p=0.0058), and therefore cannot make
inferences about drug and dose main effects. For this reason, contrasts were raised by
opening the interaction term, i.e., studying one factor in the other levels. Because it is
interesting to know the linear and quadratic trends of doses for each of the formulations,
orthogonal polynomial contrasts were performed for doses within each formulation. The
contrasts and coefficients are as follows:
Table 34: Orthogonal polynomial contrast output. File Contrast2.
Contrasts
Drug*Dose
Contrasts1
Contrasts2
Contrasts3
Contrasts4
Contrasts5
Total

SS
df
63.85 1
1.62 1
9.78 1
51.20 1
33.81 1
160.25 5

MS
63.85
1.62
9.78
51.20
33.81
32.05

F
17.92
0.45
2.74
14.37
9.49
8.99

p-value
0.0055
0.5257
0.1487
0.0091
0.0217
0.0093

Contrasts coefficients
Drug*Dose
Cont. 1 Cont. 2 Cont. 3 Cont. 4 Cont. 5
Standar: 15.00
-1.00
1.00
0.00
0.00
1.00
Standar: 20.00
0.00 -2.00
0.00
0.00
1.00
Standar: 25.00
1.00
1.00
0.00
0.00
1.00
New: 15.00
0.00
0.00 -1.00
1.00 -1.00
New: 20.00
0.00
0.00
0.00 -2.00 -1.00
New: 25.00
0.00
0.00
1.00
1.00 -1.00

The contrast 1 tests a linear trend within the standard formulation and was significant
(p=0.0055), whereas contrast 2 tests a quadratic trend and it was not significant (p=0.5257).
100

Statistics

Contrast 3 tests a linear trend for the doses of the new formulation and it was not significant
(p=0.1487). Contrast 4 for the quadratic trend in the new wording is not significant
(p=0.0091). Contrast 5 tests whether there are differences between the new standard design
and it is significant (p=0.0217). The latter result coincides with the result obtained by the
ANOVA for the drug factor. The difference in response polynomial trends across doses
within each drug explained drugdose interaction detected in the original table.

ANOVA assumptions
The analysis of variance is sensitive to the statistical properties of random error terms of the
linear model. Traditional assumptions of ANOVA imply independent errors, normally
distributed with homogeneous variances for all observations. In addition, for designs
involving plots in block structures assumes that there is block-treatment additivity, i.e., the
blocks have an additive effect on all treatments and do not interact with them.
The verification of the underlying assumptions is done in practice through the predictors of
the random error terms that are random residuals associated with each observation.
InfoStat allows to obtain, under the ANOVA, the values of residuals, predicted,
studentized residuals and the absolute residual values. By selecting one or more of these
options will be added to the data file columns containing the values of each of the
choices. The residual associated with the ij-th observation (denoted as eij) is the difference
between the observed and predicted value by the model for the response in the ij-th
experimental unit. From residuals and its transformation you can verify compliance with the
assumptions of normality, homogeneity of variances and block-treatment additivity by
graphic evidence and/or formals (model adequacy tests). InfoStat allows to obtain residuals
in the scale of the observed variable (eij=difference between observed values and predicted
values by the model). By activating the field Residuals, InfoStat will create a column called
RDUO_"name of the variable" in the active data table. You can also activate the field stud
Res. to obtain studentized residuals defined as:

RE =

ei

S 1 hii

where S2=mean square error and hii is the "leverage" of the i-th observation (Hocking,
1996). In this case InfoStat will create a column called RE_"name of the variable" in the
data table.
Overwrite box can be activated in situations where it is processed more than one ANOVA
on the same set of data and want to save only the last set of values requested. If this box is
not checked, InfoStat will generate as many columns as analysis of variance are requested,
to number consecutively each new variable added to the table.

101

Statistics

Usually, in practice, the assumptions of ANOVA are not met exactly. If there is evidence of
serious failure to comply with the assumptions, model and/or analysis strategy may not be
right.
Here are some strategies that can be driven in InfoStat to test if they meet the ANOVA
assumptions. In the modeling of test cases, we used the data in the block.
Normality: selecting the residuals as analysis variable, one of the most widely used
techniques is to build a normal Q-Q plot. Using this technique is obtained a scatter plot of
the residuals obtained versus theoretical quantiles from a normal distribution. If the
residuals are normal and there is not other defects of the model, it will align on a straight
line to 45.

Observed quantiles(RES_Yield)

260.80

n= 20 r= 0.984 (RES_Yield)

132.50

4.20

-124.10

-252.40
-252.40

-124.10

4.20

132.50

260.80

Normal quantiles(-8.5265E-015,16301)

Figure 5: Q-Q-plot obtained from a model with normal distributed errors. File Block.
Having run an ANOVA and saving the residuals, you must select from GRAPHS menu,
from the toolbar of InfoStat, Q-Q plot option (normal) and use as a variable the residuals in
the model. In the figure above is added by the Show line Y=X, the line of fit for these
residuals.
InfoStat allows testing of hypotheses about normality, in the Menu STATISTICS
ONE-SAMPLE INFERENCE NORMALITY TEST (SHAPIRO-WILKS MODIFIED).
There, selectimg the residuals as analysis variable is obtained W* statistic of modified
Shapiro-Wilks by Mahibbur and Govindarajulu (1997).
Table 35: Normality test. File Block.
Shapiro-Wilks (modified)
Variable
n
Mean
RES_Yield
20
0.00

The hypotheses tested are:


102

S.D.
W*
127.67 0.96

p(one tail)
0.7824

Statistics

H0: The residuals have normal distribution versus H1: residuals do not have normal
distribution.
In this case there is no evidence to reject the assumption of normal distribution (p=0.7824).

2.50

286.46

1.18

145.33

RES_Yield

RES_Yield

Homogeneity of variances: When the errors are homoskedastic, with a scatter plot of
residuals versus predicted values should be observed a cloud of points without any pattern
(random pattern). If the chart shows a structure there is evidence to suspect on the
complyment of the assumption. A typical pattern indicating lack of homogeneity in the
variances, is shown in Figure 6. For the example worked in this section, the predicted versus
residual plots shown in Figure 7, there is no trend indicating a lack of compliance with
the assumption of homogeneity of variances.

-0.13

-1.44

-136.93

-2.76
28.61

4.20

39.19

33.90

44.48

PRED_Yield

Figure 6: Scatter plot of residuals against


predicted showing heteroscedasticity of
variances

-278.06
1791.78

2211.26

2630.75

3050.23

3469.72

PRED_Yield

Figure 7: Scatter plot of residuals against predicted


showing homoscedasticity of variances.

Another strategy for the validation of the assumption of homoscedasticity for the treatment
factor is the Levene test (Montgomery, 1991). Although this test was developed to
completely randomized design (a one-way classification); it could be extended to more
complex models. The test involves an analysis of variance using as dependent variable the
absolute value of the residuals. This analysis should be performed with a model one-way
classification.
The hypotheses tested are:
H0:12=22=...=a2 versus H1: At least two variances are different where i2 is the variance
of treatment i, i = 1 ,..., a.
If the p-value of the treatment factor of this ANOVA is less than the nominal significance
value is rejected the hypothesis of homogeneous variances, otherwise the assumption of
equal variances can be sustained. InfoStat does not have implemented this test as such in the
section on hypothesis testing, but can easily be constructed since it can automatically save
the absolute values of the residuals with the option abs (residuals).
103

Statistics

Independence: To verify the assumption of independent errors, you can make a scatter plot
of the residuals based on the presumed variable can generate dependencies on the
observations. A classic example is the time sequence in which the operators made the
observations, if the measurement technique and/or observation may be affected by operator
fatigue, the residuals cannot be independent of the sequence of data collection. The
dependency structure may be related to the way data were collected. A tendency to be
clustered positive and/or negative residuals indicates the presence of correlation or lack of
independence. In general, a good process of randomization ensures compliance with the
assumption of independence.
Block-treatment additivity: In addition to the classical assumptions of ANOVA to a way
of classification (no plot structure), i.e. independent and identically distributed errors (i.i.d.)
normal with mean zero and variances homogeneous, in designs with plot structure, assumes
that it does not interact with the structure of treatment, i.e., their effects should be
additive. There should not be an interaction between the components of the design structure
and components of the structure of treatment, since it is assumed that the relationship
between treatments is consistent from block to block (except for random
variations). Another way to understand this assumption is to think that the blocks or groups
of homogeneous plots have no influence on the differences between treatments. If it were
not so you should use an experimental design that considers the interaction between the
factors used per block and treatments. To explore this assumption may be useful to plot the
values of the variable (or residuals) in the Y axis and treatment on the X axis and use
connectors between plotted points coming from a single block (partition by block). The
existence of crosses or lack of parallelism of the profiles plotted suggests a lack of additivity
or presence of block-treatment interaction.
The following figure shows an experimental design with 4 blocks and 4 treatments
(genotypes) where it is observed that the order of treatment is the same in each block (no
block-treatment interaction).

Figure 8: Dot plot to analyze the interaction block-treatment.


Following the chart to assess the additivity assumption of block-treatment model fitted with
the file data block is presented.
104

Statistics

3449.55

Yield

3027.19

2604.82

2182.46

1760.10
0.80

1.90

3.00

4.10

5.20

Treatment
Block-1

Block-2

Block-3

Block-4

Figure 9: Dot plot to analyze the interaction block-treatment for Block file.
As you can see there are crosses between some profiles, however, they are not considered
serious enough to doubt the fulfillment of the assumption.
There are formal tests to verify compliance with the assumption of block-treatment
additivity (Montgomery, 1991).

Analysis of covariance
In addition to the techniques of control of the structure of plots to improve the accuracy of
the comparisons between treatments, there is another technique that involves the use of
covariates and is called analysis of covariance. The covariate is a variable that is observed
on each of the experimental units and if it is linearly related with the variable under study it
can be used to correct the response variable, before making comparisons between
treatments. This technique is a combination of analysis of variance and linear regression. It
can be used regardless of the structure of plots and/or treatments that have the
experiment. The covariate requires distributional assumptions similar to those of regression
analysis. Additional assumptions for the use of this technique are not covariate-treatment
interaction and equality of regression coefficients between the treatment groups. The model
used for this analysis also includes the structures of treatments and plots, the addition of one
or more explanatory variables. If the design is completely random and there is a covariate to
adjust the model is:
Yij=+i+Xij+ij

with i=1,...,a; j=1,...,n

105

Statistics

where is the general average, i the effect of the i-th treatment, is the unknown
parameter representing the change rate in Y against the unitary change in X; Xij is the
regressor variable or covariate and ij is the random error associated with the experimental
unit. Usually the error terms are assumed normally distributed with zero expectation and
common variance 2.
If the design provides a structure of plot, they must be declared in the model as usual. If
there are more than one covariate, they will add to the model as 1X1ij , 2X2ij etc.
In the table of variance analysis is added a new factor (covariate) with one degree of
freedom on which you can do inferences. The comparisons and/or a posteriori
contrasts must be conducted on the corrected mean for the effect of covariate. By entering
one or more covariates in the analysis of variance InfoStat automatically displays the
average of the variable dependent, for each level of a classification factor, adjusted the mean
values of covariates. For analysis of assumptions InfoStat will directly the residuals and
predicted adjusted by the model with the covariate/s included. The residual for a design one
way of classification, in the case of a covariate, the form is as follows:

eij = yij yi. xij xi.

Example 21: In order to compare the increase of diameter at breast height (DBH) over a
period of 5 years for three species of carobs, observational study was conducted on a total
of 39 trees, selected at random from a mount on which nigra, flexuosa and chilensis species
were represented. Once selected trees to measure, count the number of carob trees
(regardless of species) with more than 5 cm dbh growing in a radius of 15 meters
(neighbors). This variable was used as covariate in the ANOVA to determine differences in
the growth of three species. Data is in Covariance file.
The data file contains three columns, one identifying the covariate ("neighbors"), another
species ("Species"), another to the response variable ("increase"). To analyze you can select
Menu STATISTICS ANALYSIS OF VARIANCE. If the window of selector
variables of Analysis of variance was declared "Species" as a Class variable, "increase" as
the Dependent variable and "neighbors" as Covariates, the following window Analysis of
variance will indicate that the "Species" and "neighbors" variables have been selected as
classification variables and displayed in the subwindow Specification of model terms. In
the Comparisons flap was chosen LSD test, in Means to show by was checked "Species"
and allowed the significance level of 0.05 (default). When Go is opened a window of
Results containing the information presented in Table 36.

106

Statistics
Table 36: Analysis of co-variance output. File Covariate.
Analysis of variance
Variable
Increase

N
39

R
0.99

Adj R CV
0.99 5.16

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
p-value
Model
1776.61 3
592.20 994.93 <0.0001
Species
61.70 2
30.85 51.83 <0.0001
Neighbors#
1570.42 1
1570.42 2638.38 <0.0001
Error
20.83 35
0.60
Total
1797.45 38

Coef

-4.93

Test:Fisher LSD Alpha:=0.05 LSD:=0.63368


Error: 0.5952 df: 35
Species
Means n
S.E.
Chilensis
14.06 19
0.18 A
Flexuosa
14.26 8
0.28 A
Nigra
16.84 12
0.22
B
Different letters indicate significant difference between location parameters (p<=
0.05)

When introducing covariates, InfoStat adds to the ANOVA table a column with the
regression coefficients (coefficient) of the same. In this example, there is a significant linear
relationship (p<0.0001) with negative slope (-4.93) between growth and the number of
neighbors. As the number of neighbors may be different for trees of different species is
necessary to discount the effect of the number of neighbors on growth before making
comparisons between treatments (means adjusted for covariate). InfoStat performed
automatically this operation if you enter the number of neighbors as a covariate in the
selector variables for this analysis.

Non-parametric analysis of variance


Kruskal-Wallis test
Menu STATISTICS NONPARAMETRIC ANOVA KRUSKAL-WALLIS.
It allows a nonparametric analysis of variance with one-way classification. The ANOVA
proposed by Kruskal and Wallis (1952) allows comparison of the hopes of 2 or more
distributions without making the assumption that the error terms are normally distributed.
The null hypothesis states that 1=2=,...,=a, where i represents the hope of the i-th
treatment, with i=1, 2 ,...., a. This test applies when there are independent samples of each
population, with observations of a continuous nature and the population variances are equal.
The statistic test (H) is based on the sum of the ranks assigned to observations within each
treatment. Its exact distribution is obtained from considering all possible configurations of
the ranges of N observations in a groups of n observations each one.
107

Statistics

InfoStat uses the exact distribution of the statistic for cases where the total configuration of
the ranges does not exceed 100000. The number of possible configurations of the range
increases rapidly with the increasing number of treatments and/or the number of repetitions
by treatment. For three treatments with three replicates each one, the number of
configurations is 1680 but if each treatment has 5 replicates, the number of configurations is
greater than 100000. For experimental situations where the number of configurations is
greater than 100000, InfoStat obtains p-values of the test through the approximation of the
statistical distribution of chi-square distribution, with a-1 degrees of freedom.
To perform the nonparametric ANOVA, the window Kruskal-Wallis test, indicate in
Variables the variable/s that will act as a dependent/s, and as Classification criteria, the
factors to consider, i.e., the column/s of the file that defines the groups for which averages
are compared. The user may require mean, median, standard deviation (SD), average ranges
and sample size (N) for each treatment. You may require statistical Kruskal Wallis (H), the
associated p-value (p) and the correction factor that InfoStat uses to update the statistic in
cases of ties (C).
InfoStat allows you to request A pair comparisons between the means of the ranges of
treatments and/or Contrasts between means of treatments ranges. The procedure used to
judge the significance of multiple comparisons and contrasts postulates is described in
Conover (1999). While comparisons between treatments were made through the differences
between the mean ranks, InfoStat can also view the differences between treatments at the
level of means and medians of the original values of the variables.

Friedman test
Menu STATISTICS NONPARAMETRIC ANOVA FRIEDMN, it allows a
nonparametric analysis of variance two-way classification. The ANOVA proposed by
Friedman (1937, 1940) allows comparison of the hopes of 2 or more distributions when
designing the experience was randomized complete block without having to verify
compliance with the assumption of normality.
This test requires that observations are independent and that the population variances are
homogeneous. The hypothesis to be tested is: H0: 1=2=,...,=a
where i is the hope of the i-th treatment, with i=1, 2,...., a.
Example 22: Suppose that 12 homes will test the durability of 4 types of light bulbs. Each
householder will report the number of hours of each type of bulb. The data is in the
Friedman file. The hypotheses tested are H0: No difference in durability among the four
types versus H1: At least one type has a different durability.
The design of this experience could be assimilated to a randomized complete block design.
The observations of each head of household could be considered more correlated than
observations from different heads because they come from tests performed in the same
household. Each home can be considered as a block. The blocking strategy of the
108

Statistics

observations was made for the purpose of distinguishing variation due to households (which
will probably make a different use of light bulbs), of the random variation or experimental
error. Each household evaluates the 4 types or treatments. To perform the test of Friedman,
the rank transformation is applied to the variable under study within each home or block.
Note: this test to InfoStat requires a file with a columns, one for each treatment and each "case"
corresponds to a block.

In the Friedman Test window, you should indicate which are the columns of the file that
represent the treatments in Variables that define treatments, in the example are: Bulb 1,
Bulb 2, Bulb 3 and Bulb 4. In the next window you may require multiple comparisons a
posteriori with the significance level desired. Comparison procedures are based ranges by
treatment and variance of the ranges as described in Conover (1999).
Table 37: Friedman test output. File Friedman.
Friedmans test
Bulb 1 Bulb 2 Bulb 3 Bulb 4 T
3.21
1.96
2.04
2.79 3.28

p
0.0331

Minimum significant difference between sum of ranks = 11.498


Treatment
Sum of Ranks
Mean of Ranks n
Bulb 2
23.50
1.96 12
A
Bulb 3
24.50
2.04 12
A
B
Bulb 4
33.50
2.79 12
A
B
C
Bulb 1
38.50
3.21 12
C
Different letters indicate significant difference between location parameters (p<=
0.050)

Linear regression analysis


Menu STATISTICS LINEAR REGRESSION, it allows to study the functional
relationship between a response variable Y (dependent variable) and one or more
explanatory variables X (independent or predictor variables). The first case is known as
Simple Linear Regression and the second as Multiple Linear Regression (Draper and Smith,
1998).
By regression is studied how changes in the variable/s predictor/s affect the response
variable, by setting a model for the functional relationship between them. Generically, the
relationship between the variables is modeled in the form Y=X+, where Y is the vector of
observations, X is the matrix containing the explanatory variables, is a vector of
parameters that will be estimated from the data and is the vector of random error terms.
InfoStat uses the method of least squares to obtain estimates of the coefficients of the
equation that explains the relationship between variables. From these coefficients is
109

Statistics

constructed the prediction equation that allows to know the predicted value of Y for any
value of variable/s regressor/s within the domain of experienced values. InfoStat also
performs analysis weighted least squares regression to consider situations of heterogeneity
of variances of error terms.
Through the analysis of variance you can know how much of the data variation is explained
by the regression and how much should be regarded as unexplained or residual. If the
variance explained is substantially greater than the unexplained variation, the proposed
model will be good for predictive purposes. A measure of the predictive ability of the model
is the coefficient of determination R2 which relates the variation explained by the model
with the total variation.
When the data contain more than one observation for at least one independent variable
value, you can get a measure of pure error (not caused by poor specification of the model)
and you can prove the unadjusted model (unadjusted ).
To identify a good model and to monitor compliance with the assumptions of the analysis,
InfoStat provides different measures considered as criteria for diagnosis.

Model
The equation of multiple linear regression model is:
Yi= 0 + 1 x1i +2 x2i + ..+ k xki + i
where
Yi= i-th observation of the depenent variable Y
x1i, x2i,...,xki = i-th value of the regression variables X1, X2,...,Xk or independent
0 = unknown parameter that represents the intercept of the line (indicates the expected
value of Y when x1 =0, x2 =0,...,xk=0)
1, ...,k = unknown parameters that represents the exchange rates in Y front of the unit
change of X1,X2...,Xk, respectively
i = random error term
In general, it is assumed that within the domain studied, the relationship between response
and explanatories is well approximated by the proposed linear regression model.
The predictor variables are variables selected by the researcher that represents the
populations from which were obtained the responses.
Random errors are independent and normally distributed with mean zero and constant
variance.
110

Statistics

Menu STATISTICS LINEAR REGRESSION, it allows to do this analysis


where involved variables are declared in the Linear regression window. The Y variable
should be placed as a Dependent variable and X variables are assigned as Regressors.
Weights (only one) subwindow must be used when you want to ponder the observations of
the response variable in a differential way. If you do not specify a column that contains the
weights for the weighting, the estimates reported for the model parameters are obtained by
the method of ordinary least squares estimation. If the weights are given the reported
estimates are obtained by the weighted least squares method.
To perform cluster analysis, the variable that separates the data into groups is declared in the
Partition criteria flap Partition criteria option. In this flap is also the option Weights to
indicate the variable representing weights in a weighted least squares analysis.
The test window shows flaps General, Diagnostic, Polynomial, Hypothesis and Model
selection.
General flap: it allows to select the information to be displayed in the results. By default the
results are shown in the matrix regression coefficients and the table of analysis of variance,
but you can add diagnostic criteria and the covariance matrix of regression coefficients.
InfoStat allows to select the following options:
Regression coefficients and associated statistics: reports for each parameter included in
the model (coefficient) the estimated (Est), the estimate standard error (SE), the limits of
confidence interval 95% (LL and UL), the value of the statistic T to test the hypothesis
that parameter is zero, the p significance value for the test of hypothesis based on T and Cp
of Mallows index.
Cp of Mallows: For each term in the model InfoStat calculates the Cp index as follows:
=
Cp

SCErrorp

(n 2 p)

CMError

where SSErrorp which is the error sum of squares of a reduced model (with p parameters
including the constant) for the full model specified by the user. The reduced model contains
all the terms of the full model less the end of the line where Cp is reported. MSError value is
the mean square error for the full model specified by the user and n is the total number of
observations. Then, for each regressor there is an indicator of its contribution to the fit of the
model proposed by the user because closed Cp values to p correspond to models with small
bias in the prediction. If you take out a regressor, the Cp value increases much, then you
might think that this regressor is important for model fit.
In cases of multiple regression the elimination of one or more explanatory variables can
increase the predictive value of the model even though the R2 decreases. Cp of Mallows
for the constant term is not reported by InfoStat.

111

Statistics

Table of variance analysis: Shows the ratio R2, adjusted R2, the mean square error of
prediction and analysis of variance for the model specified. This table includes the test
for model fit (reported as Lack of fit) when the subwindow Options, choose the pure error.
The R2 coefficient measures the proportion of the variation in Y that is explained by the
relationship with X. The R2 is calculated by dividing the sum of squares of the model and
the sum of squares total. The adjusted R2 is derived from the expression:
RAj2 =1 (1 R 2 ) [ (n 1) /(n p) ]

where n is the total number of observations and p the number of parameters of the fitted
model.
In the analysis of variance, by default the sums of squares are type III, but it may indicate
the use of type I sum of squares. The Type I sums of squares are called sums of squares
because they partition the sum of squares of the model in the sequence of incorporation of
terms into the model. That is, assuming that the specified model is Y= 0+1X1+2X2, the
sum of squares for X1 is the sum of squares for the model containing the constant and X1,
the sum of squares for X2 is reduced the sum of squares to error due in the model X2 is
incorporated. Because type I sums of squares are dependent on the order in which the terms
are added to the model, under different systems will be different sums of squares. InfoStat
will calculate the type I sums of squares incorporating the terms to the model in the order
assigned to the variables in the selector variables. The type I sums of squares are especially
recommended for polynomial regression models.
Summary table of diagnostic criteria: shows the extreme values, maximum (max) and
minimum (min), studentized residuals (sr), externally studentized residues (esr), the
leverage (Lev) and Cook's distance (Cook ), identifying where they are associated with each
of these extreme values.
Covariance matrix: displays the covariance matrix of the estimates of the regression
coefficients.
Options subwindow: allows adding to the model the intercept, specifying work with
targeted regressor (prior to regression analysis, the regressor are centered by its mean. The
constant term corresponds to the mean response for the average conditions of the regressor),
request test of Atkinson, getting the calculation of pure error to test the lack of fit of the
proposed model and get all the simple regressions.
Atkinson test can determine if is necessary to use the power transformation:
Y * = Y 1

InfoStat estimates parameter that relates with the power transformation as: =1-.
Atkinson test allows contrasting the hypothesis =0. If we do not reject the hypothesis
implies that =1 and therefore you do not need to apply the power transformation to the
data. If the test is significant the transformation power is desirable and the exponent of the
112

Statistics

transformation is given by 1-. The estimate is reported in the Table of regression


coefficients as estimated (Est) for the coefficient of Atkinson. This test cannot be performed
if the independent variable has zero values. In this case we can transform the variable by
adding a constant to obtain the test results.
The goodness of fit test (unadjusted) allows us to confirm if the model used fits. It requires
an estimate of 2 independent of the model, called "pure error". To estimate it, it is
neccesary to have more than one observation for at least one point in the domain of the
regressor variable. The results are read in the table of analysis of variance.
Diagnostics flap: select elements of diagnosis, indicate the calculation of the predicted
values and apply confidence and prediction intervals, choosing a level of confidence.The
requested information is saved as new variables in the original file. The calculation of the
predicted values and confidence and prediction intervals also be made for those values of X
in the data table, which do not have the corresponding value of Y. Likewise, if you wanted
to know the predicted value using a x value that is not in the data set that value must be
entered in the table and run the analysis again requesting interest calculations.
The predicted values are the values of the dependent variable obtained using the adjusted
model. The adjusted model is built with the parameter estimates.
The confidence intervals are intervals for hope of Y given X=x0, that in the simple
regression case are given by the following expression:

y 0 T1 / 2

2
1

x0 x )
(

S +
2
n x i2 ( x i ) n
2

where T1-/2 is the corresponding quantile of the T Student distribution with n-2 degrees of
freedom and S2 is the estimate of 2. When confidence intervals are obtained for all values
of X in a given course and unite, we obtain the confidence bands.
Prediction intervals are intervals for the values of Y given x=x0 that in the simple
regression case have the following expression:
2
1
( x0 x )
y0 T1 / 2 S 2 1 + +

2
2
n x i ( x i ) n

In case of multiple regression the expression for this interval will be:
1
1

T
y 0 T1 / 2 S 2 1 + + ( x 0 x ) ( XT X ) ( x 0 - x )
n

where y 0 is the predicted value, x0 is the vector of given values for the desired prediction,
x is the mean vector of the explanatory variables and X is the design matrix. The quantile
113

Statistics

T1 / 2 corresponds to a t-Student distribution with n-p degrees of freedom, where p is the


number of parameters (regression coefficients).
When obtaining prediction intervals of Y for the values observed in the sample and are
joined between them, the upper and lower each other, we obtain the prediction bands.
Difference between confidence and prediction interval is that the first defines a region that
with probability 1- contains the hope of Y given the regressor, while the limits of a
prediction interval are those in which expected, with probability 1-, see a future realization
of Y given the regressor.
In the Diagnostics flap you can activate the graphic field Fit, Confidence intervals and
Prediction intervals to obtain automatically a graph with these components.
As diagnostic measures, InfoStat allows to obtain the residuals (RDUO_ "variable name"),
studentize residuals (RE_"variable name"), externally studentized residuals (REE_"variable
name"), Cook's distance (COOK_"variable name"), leverage (LEVE_"variable name") and
partial residuals (RPAR_"variable name"), which are added to the data table (under the
name shown in parentheses) and calculated as:
Studentized residual: is given by the ratio of the residual associated with observation i and
the square root of the product between the mean square error (S2) and the term (1-hii) where
hii is the leverage.

RE =

ei

S 2 (1 hii )

Externally studentized residual: is the ratio between the residual associated with
observation i and the square root of the product between the mean square error (Si2),
calculated after the removal of residue i, and the term (1-hii) where hii is the leverage.

REE =

ei

Si2 (1 hii )

Leverage: is a measure of the contribution of the i-th observation to the i-th adjusted value.
The leverage are the diagonal elements (hii) of matrix H, with H=X(X'X)-1X' where X is the
design matrix formed by the explanatory variables. The HY product originates the adjusted
values with Y the vector of observations.
Cook's distance: allows to measure the influence of the i-th observation. Great values of
this measure indicate observations whose removal has great influence on the predicted
values. The expression for calculation is:
2

( )

1
e
hii
Cook = i
p

h
1

ii
S 2 (1 hii )

114

Statistics

where p is the number of model parameters.


Partial residuals: are obtained as part of a multiple regression from the residuals associated
with the full model plus the product of a particular regressor and its regression
coefficient. For example, in a multiple regression involving two regressor X1 and X2,
thepartial residue associated with X2 for the i-th observation is obtained as follows: 1)
adjusting the model Yi=0+1X1i+2X2i+i and getting the residue ei 2) calculating
RPi=ei+2X2i+i
Tipical button active as diagnostic elements those statistical typically selected at the stage
of diagnosis and evaluation of model fit. These are: Studentized residuals, predicted and
leverage.
If diagnostic elements are stored, these will be added as new columns to the active table, so
if the analysis is performed several times, will generate many new columns as elements are
requested (and shall be numbered consecutively). To avoid this proliferation of columns can
activate the Ovewrite field.
The analysis of the diagnosis statistical can be supplemented by graphics. A chart
commonly used in linear regression is the scatter diagram. Usually is done more than one
plot: dispersion of Y versus X (or every X in the case of multiple regression), dispersion of
predicted versus residuals, residuals versus X, leverage and/or partial residuals versus each
of the regressor. Through these graphics can detect outliers, influential points, violations of
assumptions and the type of relationship between the variables to improve the proposed
model. To obtain these graphs in the Diagnosis flap must activate the correspondant fields
Save. To construct other graphs, see Graphics Chapter.
Polynomial flap: allows a polynomial fit of degree n, including in the model quadratic,
cubic and n terms, indicated by the user for the predictor variables are selected.
Hypothesis flap: allows to specify hypotheses about one or more coefficients from
regression model under the general form Hb=h. The b vector is the vector of regression
coefficients and H is a matrix of contrasts. The rows of H will contain the coefficients
(specified by the user) of a linear combination of the elements of b and h is the vector
containing the hypothetical values of Hb, specified by the user. If you do not enter values
for h, the system will assume that the values are zero.
Model Selection flap: allows the implementation of backward elimination indicating the
maximum p-value for retaining the component in the model.
Example 23: To study the relationship between biomass and pH in culture medium was
measured biomass (g) to pH values between 3 and 7 of 45 measurements recorded. The data
is in the LinearReg file.
Dependent variable was the biomass and as explanatory variables the following graph pH.
El obtained by default, shows the behavior of the variables.

115

Statistics
1014.74

Biomass

894.25

773.77

653.29

532.80
2.80

3.90

5.00

6.10

7.20

pH

Figure 10: Scatter plot of Biomass versus pH. File LinearReg.


The diagram would indicate that there is a positive relationship between biomass and
pH. The regression analysis reported the following results:
Table 38: Linear regression analysis output. File LinearReg.
Linear Regression
Variable
Biomass

N
45

R
0.95

Adj R PMSE
AIC
BIC
0.95 909.73 434.87 440.29

Regression coefficients
Coef
const
pH

Est. S.E.
313.95 15.87
95.56 3.35

LL(95%) UL(95%) T
281.94 345.96 19.78
88.80 102.32 28.51

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
Model 685876.59
1
685876.59
pH
685876.59
1
685876.59
Error
36284.63
43
843.83
Total 722161.23
44

p-value MallowsCp
<0.0001
<0.0001
795.36

F
p-value
812.81 <0.0001
812.81 <0.0001

As you can see, in the table of analysis of variance, there is no linear relationship between
biomassand pH (p<0.0001). It is also noted that the proposed model has no lack of fit
(p=0.4348). Taking the information about the regression coefficients you can write the
equation of the fitted model:

y a+bx = 313.95 + 95.56 x


=
This line allows us to estimate the value of y (predicted value) for a value of x. The fitted
model can be used for predictive purposes, e.g. to a pH of 3.5 is expected biomass
648.41 gr . This result, like any other desired prediction using X
of y = 313.95 + 95.56 ( 3.5) =
values inside or outside the range studied, can be obtained automatically when entering
as data in the pH column the value 3.5 and the analysis run again asking for the calculation
116

Statistics

of predicted. The following graph was obtained by applying at the Diagnostic flap obtaining
confidence bands and prediction:
1063.53

Biomass

928.00

792.48

656.95

521.43
2.8

3.9

5.0

6.1

7.2

pH

Figure 11: Scatter plot of Biomass versus pH with confidence and prediction bands. File
LinearReg.
In the figure above the center line is for the adjusted model, the following lines correspond
to confidence bands and the lines outside the prediction bands.

2.63

0.11

1.35

0.08

Leverage_Biomass

Rstudent_Biomass

To view some elements of diagnosis was made studentized residual plots vs. predicted,
leverage and Cook's distance. InfoStat generated the following:

0.07

-1.21

-2.49
581.51

0.06

0.04

0.02
-1.20

686.63

791.74

896.86

1001.97

10.90

23.00

35.10

47.20

Case

Predicted values

Figure 12: Scatter plots of residuals versus predicted y Leverage versus Case. File LinearReg.

117

Statistics
0.10

DCook_Biomass

0.08

0.05

0.02

0.00
-1.20

10.90

23.00

35.10

47.20

Case

Figure 13: Scatter plots of Cook statistic versus Case. File LinearReg.

Validation of assumptions
Normality

Observed quantiles(RES_Biomass)

68.70

n= 45 r= 0.993 (RES_Biomass)

35.31

1.92

-31.48

-64.87
-64.87

-31.48

1.92

35.31

68.70

Normal quantiles(-1.8948E-015,824.65)

Figure 14: Q-Q plot. File LinearReg.


Note that in the QQ plot was carried out with the residuals of the regression model using as
a theoretical distribution the Normal (see QQ-plot). The points are arranged in a line at 45
indicating that the distributional assumption for residuals is true. Performing the ShapiroWilks test (modified) TWO-SMAPLE INFERENCE menu is concluded that the data follow
a normal distribution (p=0.8295). The results of this test are presented in the following table.
Table 39: Shapiro-Wilks normality test for residuals. File LinearReg.
Shapiro-Wilks (modified)

Variable

118

Mean

S.D.

W*

p(one tail)

Statistics
RES_Biomass

45

0.00

27.76

0.98

0.9082

Homoscedasticity

In Figure 15, shows that the points for higher pH values have less scatter than the rest,
which is why a formal test of homogeneity of variances would be advisable.
68.69

RES_Biomass

31.83

-5.04

-41.90

-78.77
573.79

678.45

783.12

887.79

992.46

PRED_Biomass

Figure 15: Scatter plots of residuals versus predicted values. File LinearReg.
Here is an example of application of multiple linear regression technique.
Example 24: To study the relationship between pH
(pH), salinity (salinity), the contents of Zn (Zin)
and K (potassium) present in the ground with the
biomass production of forage, there were 45
measurements of biomass (g) and those
characteristic values of the soil where plant
grew. The data is in the Salinity file.
In the selector variables, Dependent variable was
identified as the "biomass" and as Regressors
to "pH", "Salinity", "Zinc" and "Potassium".

119

Statistics

Then in the window Linear regression,


indicated what information you wanted to
display in the results window.
In this example we selected the ANOVA table
associated with the linear regression model and
the table containing the estimates, standard
errors and other statistics related to each
coefficient of the regression model proposed
(defined from the explanatory variables
selected.) You could also request a summary
table of diagnostic criteria for assessing model
fit. In the Options subwindow may indicate
whether the model or not to include a term
corresponding to the Intercept, if you want to
work with the Centered predictors by its mean, if the analysis of variance table must
contain the test proposed by Atkinson, if you want an estimate of Pure error and if the
output should contain All simple regressions can be adjusted from the set of regressors
selected.
In the Diagnostic flap, you can specify the diagnostic action to be displayed. Pressing the
Tipical button is stored in the active data table the studentized residuals, the Predicted
values by the fitted model and the Leverage of each observation. The Plot partial residuals
field can automatically obtain partial residual plots for each regressor of the model. This
field should be activated in a first step in a multiple linear regression analysis since it allows
for a preliminary idea of the adequacy of the model. Figure 16 presents these graphics, in
which we can see that: 1) There is a positive linear relationship between biomass and soil
pH, 2) There is a negative relationship between biomass and soil salt content, but the chart
also suggests the presence of a quadratic component in this relationship, 3) There is a
negative relationship between biomass and soil zinc content, 4) Apparently there is no linear
relationship between biomass and soil potassium.
In Table 40, shows the result obtained by adjusting the regression model with all
explanatory variables.

120

Statistics

646.34

PRES_Biomass_Salinity

PRES_Biomass_pH

944.54

550.51

156.48

-237.56

-631.59
2.99

6.49

5.32

4.16

376.08

105.83

-164.43

-434.69
23.30

7.

27.15

pH

34.85

38

434.30

PRES_Biomass_Potassium

717.42

PRES_Biomass_Zinc

31.00

Salinity

407.55

97.67

-212.20

-522.08
-1.34

7.20

15.75

224.15

14.00

-196.14

-406.29
296.18

24.29

596.19

896.20

1196.21

14

Potassium

Zinc

Figure 16: Partial residual plots. File Salinity.


Table 40: Multiple linear regression analysis output. File Salinity.
.
Linear Regression
Variable
Biomass

N
45

R
0.92

Adj R
PMSE
0.92 33301.86

AIC
BIC
590.55 601.39

Regression coefficients
Coef
const
pH
Salinity
Zinc
Potassium

Est.
S.E. LL(95%) UL(95%) T
1492.81 453.60 576.05 2409.57 3.29
262.88 33.73 194.71 331.05 7.79
-33.50 8.65 -50.99 -16.01 -3.87
-28.97 5.66 -40.42 -17.52 -5.11
-0.12 0.08
-0.28
0.05 -1.40

p-value MallowsCp
0.0021
<0.0001
63.28
0.0004
18.65
<0.0001
29.55
0.1680
5.95

121

Statistics
Analysis of variance table (Partial SS)
S.V.
SS
df
MS
Model
12120944.19
4
3030236.05
pH
1533665.03
1
1533665.03
Salinity
378485.90
1
378485.90
Zinc
660588.37
1
660588.37
Potassium
49785.48
1
49785.48
Error
1009974.02
40
25249.35
Total
13130918.21
44

F
120.01
60.74
14.99
26.16
1.97

p-value
<0.0001
<0.0001
0.0004
<0.0001
0.1680

You can see that pH, salinity and zinc showed a p value p<0.05, i.e. have significant linear
relationship. Potassium presented a p-value p=0.1680, i.e. was not significant linear
relationship with biomass.
In multiple regression analysis should not be deleted from the model explanatory variables
without first securing the appropriateness of it. For this reason, before deleting Potassium as
regressor, new model was adjusted incorporating a the quadratic component for salinity
regressor, following the same steps used earlier to fit the model, and adding in the
Polynomial flap a polynomial of 2 degree for Salinity and suggested the partial residual
plots of biomass versus salinity. The results obtained with this new model are presented in
Table 41. When one or more explanatory variables is incorporated one or more polynomial
terms, InfoStat automatically display the output window in the ANOVA table for regression
with sums of squares of type I (sequential) and the table of sums of Type III square.
Table 41: Multiple linear regression analysis output with polynomial terms. File Salinity.
.
Linear Regression
Variable N
R Adj R
PMSE
AIC
BIC
Biomass 45 0.92
0.92 33301.86 590.55 601.39
Regression coefficients
Coef
Est.
S.E. LL(95%) UL(95%) T
p-value MallowsCp
const
1492.81 453.60 576.05 2409.57 3.29 0.0021
pH
262.88 33.73 194.71 331.05 7.79 <0.0001
63.28
Salinity
-33.50
8.65 -50.99 -16.01 -3.87 0.0004
18.65
Zinc
-28.97
5.66 -40.42 -17.52 -5.11 <0.0001
29.55
Potassium
-0.12
0.08
-0.28
0.05 -1.40 0.1680
5.95

Analysis of variance table (Partial SS)


S.V.
SS
df
MS
F
Model.
12120944.19 4 3030236.05 120.01
pH
1533665.03 1 1533665.03 60.74
Salinity
378485.90 1 378485.90 14.99
Zinc
660588.37 1 660588.37 26.16
Potassium
49785.48 1
49785.48
1.97
Error
1009974.02 40
25249.35
Total
13130918.21 44

122

p-value
<0.0001
<0.0001
0.0004
<0.0001
0.1680

Statistics

2.47

2.47

1.38

1.38

SRES_Biomass

SRES_Biomass

You can see that now, besides pH, salinity and Zinc, Potassium regressor presented the p value
p<0.05, i.e. presents significant linear relationship. Furthermore, with the addition of the quadratic
term for salinity, the Cp of Mallows for Potassium increased 5.95 (Table 40) to 13.94 (Table 41),
suggesting that this regressor has important predictive model incorporating the quadratic
term for salinity. To verify the adequacy of the final model were made studentized residual plots (SR)
versus each of the regressor (Figure 17) and they can be seen that there are trends that suggest lack
of fit.

0.28

-0.82

-1.92
2.99

4.16

5.32

6.49

0.28

-0.82

-1.92
23.30

7.66

27.15

pH

34.85

38.70

2.47

SRES_Biomass

SRES_Biomass

2.47

1.38

0.28

-0.82

-1.92
-1.34

31.00

Salinity

7.20

15.75

24.29

32.84

Zinc

1.38

0.28

-0.82

-1.92
296.18

596.19

896.20

1196.21 1496.22

Potassium

Figure 17: Partial residual plots for the polynomical model. File Salinity.

Regression with dummy variables


Usually in the regression the independent variables are quantitative, but may be a need to
include in the model classification variables. Classification variables are often incorporated
when analyzing data set contains information on the relationship of interest to subgroups
identified by one or more variable classification. An option in cases where there is more
than one data set and we want to analyze the functional relationship between two or more
variables is regression fit many models as there are groups. But this is not the most efficient
technique because no one adjustment uses all available information. The degrees of freedom
for the error term will be more if it fits a single model using all data, but this model should
indicate the presence of groups to prevent the interference group effect on the estimation of
123

Statistics

the functional relationship between the dependent variable and the regressor. This can be
done using auxiliary variables (dummy). To create the auxiliary variables, you can use the
auxiliary generator variables InfoStat. On the DATA menu, choose Create dummy
variables submenu. For more information see Create dummy variables in the Management
Data Chapter.
Example 25: Polymer file table presents the turbidity of the medium (Y) and pH for three
types of polymers A, B, C. The focus is on the dependence of the turbidity of the medium on
the pH. By proposing a simple linear regression model to explain the turbidity as a function
of pH, using all data and without indicating that they belonged to 3 polymers (without
indicating the presence of groups) are obtained studentized residuals that plotted as a
function of pH and using the polymer variabler as a criterion partition is displayed as
follows:
2.02

SRES_Y

1.11
0.20
Polimer_A
Polimer_B
Polimer_C

-0.72
-1.63
6.37

7.11

7.85

8.59

9.34

pH

Figure 18: Scatter plor of residuals versus pH.Partial residual plots. File Polymer.
Placing a cut-off at 0 for the residuals it is clear that this is not a desirable pattern for a chart
of residuals vs. predicted. There is no a random pattern but instead shows a behavior of
residuals associated with the type of polymer. You want then to propose a regression model
that considers that the observations are grouped or classified by the existence of different
polymers.
It is incorporated the effect of polymer (classification variable with three levels) to the
model through the use of auxiliary variables, postulating the model:
Y = 0+1pH+2D1+3D2+4D1*pH+5D2*pH+
where D1 and D2 represent two auxiliary variables and pH is the quantitative regressor
variable. The number of auxiliary variables Di to include is equal to the number of
classification levels of the factor you want to model minus 1. Each auxiliary variable is a
dummy variable that takes the value 1 only for one level of the classification factor (in this
example, each auxiliary variable must be 1 for only one type of polymer). The set of
auxiliary variables associated with a factor allows classifying the observations according to
124

Statistics

the levels of factor. Note that D1=1 and D2=0 represent the data associated with the
polymer A, D1=0 and D2=1 represent the data of the polymer B and D1=0 and D2=0 to the
data of polymer C. The polymer C is taken as the reference level because all auxiliary
variables included assume the value 0 for this polymer (see Create dummy variables in
the Management Data Chapter).
The inclusion of dummy variables the type of polymer allows to estimate the difference in
average turbidity of the polymers but not enough to establish whether the relationship
between pH and turbidity is different for different polymers. This possible difference in
slopes can be studied in the model to include terms that involve products between the
auxiliary and the regressor variables. Then the coefficients for the variables generated by the
product between a regressor and an auxiliary variable can obtain evidence of homogeneity
of slopes. For example, if the auxiliary variable D1 equals 1 for the polymer A, the test to
determine whether the slope of the line for the polymer A is equal to or different from the
corresponding reference to the reference polymer will be obtained from D1*pH. To adjust
the regression model that includes the auxiliary variables and interactions between them and
the regressor or the data file must contain auxiliary variables (in this example, D1 and D2)
and their products with the regressor (in this example, D1*pH and D2*pH). Products
between the auxiliary and the regressor variables are obtained by creating dummy variables
(menu DATA CREATE DUMMY VARIABLES), if the panel Multiply by... the
regressor variable is included, as shown in the following table:
Table 42: Data set with the inclusin of dummy variables. File Polymer.
Y
292
329
352
378
392
410
198
227
277
297
364
375
167
225
247
268
288
342

pH
6.5
6.9
7.8
8.4
8.8
9.2
6.7
6.9
7.5
7.9
8.7
9.2
6.5
7
7.2
7.6
8.7
9.2

Polymer
A
A
A
A
A
A
B
B
B
B
B
B
C
C
C
C
C
C

Polymer_A
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0

Polymer_B
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0

Polymer_A _pH
6.5
6.9
7.8
8.4
8.8
9.2
0
0
0
0
0
0
0
0
0
0
0
0

Polymer_B_pH
0
0
0
0
0
0
6.7
6.9
7.5
7.9
8.7
9.2
0
0
0
0
0
0

For regression, select Menu STATISTICS LINEAR REGRESSION. In the window


Linear regression declares Y as Dependent variables and pH, Polymer_B, Polymer_C,
Polymer_B_pH and Polymer_C_pH as explanatory variables. The scoreboard displays
the following output:

125

Statistics
Table 43: Linear regression with dummy variables. File Polymer.
Linear Regression
Variable N
R Adj R PMSE
AIC
BIC
Y
18 0.97
0.96 556.03 154.26 160.50
Regression coefficients
Coef
Est.
S.E. LL(95%) UL(95%) T
p-value MallowsCp
const
39.42 48.77 -66.85 145.68 0.81 0.4347
pH
40.26 6.10
26.97
53.56 6.60 <0.0001
45.27
Polymer_B
-306.43 71.23 -461.62 -151.24 -4.30 0.0010
22.16
Polymer_C
-197.69 68.79 -347.58 -47.80 -2.87 0.0140
12.70
Polymer_B_pH
30.95 8.99
11.38
50.53 3.44 0.0049
16.03
Polymer_C_pH
13.56 8.74
-5.48
32.60 1.55 0.1466
7.30
Analysis of variance table (Partial SS)
S.V.
SS
df
MS
F
p-value
Model.
82707.78 5 16541.56 77.76 <0.0001
pH
9261.73 1 9261.73 43.54 <0.0001
Polymer_B
3937.37 1 3937.37 18.51 0.0010
Polymer_C
1756.64 1 1756.64 8.26 0.0140
Polymer_B_pH 2524.22 1 2524.22 11.87 0.0049
Polymer_C_pH
512.47 1
512.47 2.41 0.1466
Error
2552.67 12
212.72
Total
85260.44 17

The table with the regression coefficients provides the information necessary to: 1) building
the fitting equation for each polymer and 2) testing equal average effects between groups
and homogeneity of slope.
Since each polymer is represented by a combination of the auxiliary variables to achieve the
adjustments will have to bear in mind the values of D1 and D2 corresponding to each
polymer.
To obtain the fitting equation for the polymer A to enter the full model and replace the
variable D1 (in the example Polymer_B) by 1 and the variable D1*pH by pH as D1 is 1
for this data set. All terms of the full model including the variable D2 (in the example
Polmero_C) should be excluded to obtain the fitted line of polymer A and D2 is 0 for this
polymer. Rearranging the terms without regressor and regressor terms in the pH will
have finally the equation for the polymer A, as shown below:
y =-158.27 + 53.82 pH + 197.69 D1 -13.56 D1* pH -108.74 D 2 + 17.39 D 2 * pH
y =-158.27 + 53.82 pH + 197.69 *1 -13.56 pH
=
y 39.42 + 40.26 pH

In a similar way you can obtain equations for the others polymers:
polymer B

y =
267.01 + 71.21 pH

polymer C

y =
158.27 + 53.82 pH

126

Statistics

These equations are the same that will be obtained doing regressions separately using the
polymer as a criterion variable partitions, but the standard errors of the estimates are
significantly lower in the model including the auxiliary variables as it works with more
degrees of freedom for estimate of experimental error when adjustments are made separately
(one for each polymer).
In the table of analysis of variance the p-values for interactions Polymer_B_pH and
Polymer_C_pH and indicate the results of the comparison of slopes of polymer A to C and
B to C, respectively. For a significance level of 10% polymer B and C have different slopes,
while there is no differences between the slopes of A and C. To find out if there are
differences between the average turbidity under each condition, p-values must be observed
for Polmero_A and Polmero_B. In the first case, p<0.01 indicates that there are differences
between the low average turbidity polymers A and C. The opposite was seen when
comparing B and C. The linearity of the relationship is highly significant since the p-value
for the variable pH is less than 0.01.
When there is interest in comparisons that are not provided by the analysis of variance of the
full model can be contrasts. For example, comparison of the slopes of the polymers A and B
can be done using the coefficients 1 and -1 to the terms Polymer_B_pH and
Polymer_A_pH, and 0 for the remaining terms of the model. These coefficients are entered
in the matrix H of the Hypothesis flap of Linear regression window. InfoStat
automatically displays the terms of the model for the user specified in the H Matrix
subwindow contrast ratios that can test the hypothesis of interest.
In the subwindow h Vector you must
enter the value assumed for contrast. If
you use the value zero you will be testing
the hypothesis of equal (zero difference)
between the parameters of the model with
coefficients 1 and -1. Another value might
be mentioned in this subwindow (h
Vector) if one assumes that the difference
between the selected parameters is of a
magnitude given (distance from zero). In
the subwindow H Matrix can be written
over a line of contrast, for each row of the
matrix InfoStat reports an analysis of
variance table containing the sum of
squares for contrast and the corresponding
p-value for the hypothesis that the contrast
test . If you want to assign a name to contrast put in H Matrix window the name followed
by ":" contrast ratios.

127

Statistics

The contrast raised in the Hypothesis flap can test whether there are differences of slopes
between polymer A and B. For example, the outstanding differences are significant
(p=0.0049).
Table 44: Matrix H and vector h of coefficients for the general hypothesis Hb=h.
H matrix and h vector for the hypothesis Hb=h
S.V.
const pH Polymer_B Polymer_C Polymer_B_pH Polymer_C_pH
Linear Comb. Coef. 1
0
0
0
0
1
-1

h
0

Test for the hypothesis


S.V.
SS
df
MS
F
p-value
hypothesis 2524.22 1 2524.22 11.87 0.0049

The fitted line for each polymer is shown below:

Figure 19: Scatter plor with fitted lines for each polymer. File Polymer

Non linear analysis of regression


The nonlinear regression analysis implemented in InfoStat allows to obtain the least squares
estimators of the parameters of an arbitrary nonlinear model specified by the user. As in the
case of linear regression, the first step is to select the dependent variable and regressors, and
if necessary to include a criterion of partition. Indicated variables presents a dialog box in
which the system requires the user to enter the model that relates the dependent variable
with the regressor/s. For example, if Y is the dependent variable and regressors are x and t,
a model might be:
128

Statistics

y = ( x) t
y el programa espera que se escriba alfa*x*exp(-lambda*t). Una vez que se ha escrito el
modelo, hay que validar su sintaxis. Esta accin puede realizarse dando <Enter> al final de
la expresin propuesta o apretando el botn de verificacin. El proceso de verificacin
establece si el modelo est libre de errores de sintaxis. En tal caso, identifica los parmetros
a estimar y les asigna valores iniciales por defecto. Los parmetros identificados y sus
valores iniciales aparecen en la parte inferior derecha de la ventana de dilogo. Los valores
iniciales pueden modificarse para proponer un conjunto de valores de partida que facilite la
convergencia del algoritmo de estimacin. Para editar los valores basta hacer doble click
sobre el valor que se quiere modificar. Esta accin har aparecer un campo de edicin en el
que se puede especificar el valor inicial para cada parmetro, aceptar con <Enter> o hacer
doble click en otro valor inicial a modificar.
and the program hopes to enter alpha*x*exp (-lambda*t). Once the model has been
written, we must validate its syntax. This can be done by giving <Enter> at the end of the
proposed expression or pressing the check button. The verification process determines
whether the model is free of syntax errors. In this case, identifies the parameters to be
estimated and assigned default initial values. Identified parameters and initial values appear
in the bottom right of the dialog. The initial values can be modified to propose a set of
starting values to facilitate convergence of the estimation algorithm. To edit the values
simply double click on the value to be modified. This brings up an edit field where you can
specify the initial value for each parameter, accept <Enter> or double click on another initial
value to modify.
The choice of initial parameter values is a critical issue for nonlinear regression and should
be given special attention. Some models do not converge when starting from initial values
far from those that achieve the minimization of the sum of squares or the method may
converge to a local minimum away from the general optimum.
If you accepts the model and their initial values, the next step is the estimation. Due to the
difficulty imposed by nonlinear regression methods to estimate InfoStat addresses the
problem in two phases. The first one is to search an approximate solution using a downhill
simplex method proposed by Nelder and Mead (1965) does not require the evaluation of
partial derivatives (minimizing the possibility of numerical errors). This phase ends with a
solution or reaches the preset maximum number of iterations (500 not modified by the
user). The second phase implements the Levenberg-Marquardt method (Press et al., 1986),
based on the previous solution. This method requires the calculation of the Hessian matrix
required for the calculation of the covariance matrix of the estimates. This phase ends when
the difference of the sum of squares between two successive iterations is less than or equal
to 1E-10 or when is reached the maximum number of iterations specified by the user
(default 20).
If InfoStat finds a solution, will present in the window of results a table including parameter
estimates, standard errors (asymptotic) and a T test for the hypothesis of nullity of the
respective parameter values. Also, InfoStat provides a summary of the amount of data
129

Statistics

included in the analysis, an estimate of error variance and the number of iterations in the
solution was reached by the Levenberg-Marquardt method. If the number of iterations
corresponds to the specified maximum number of iterations (20 by default), the result shown
may not be correct since the algorithm did not converge. It is appropriate in this case, repeat
the analysis with a larger number of iterations and/or from a different set of initial values for
the parameters.

Default Models
When InfoStat recognizes that there is only one regressor available, offering a set of
nonlinear functions commonly used in modeling. These functions are: Logistics, Logistics
to shift, Gompertz, Gompertz to shift, Exponential, Monomolecular and Richards. These
functions model cumulative trends. Its derivatives are also available for modeling speed of
progress of the phenomenon under study (growth, spread, progress of disease, etc.). When
InfoStat set a single regressor model also provides a view of setting, plot the points (x, y)
observed, overlapping the function set.
Example 26: The data represent the cumulative germination percentage between 2 and 14
days after sowing of seeds of forage shrub subjected to mild water stress in four
independent essays (data: courtesy Dr. Aiazzi, Faculty of Agricultural Sciences, UNC .)The
goal is to model the evolution of the cumulative percentage of germination as a function of
time. The data is in the Germination file.
Note: InfoStat requires a file with two columns, one containing the day of observation (variable X)
and another with the germination values (variable Y).

A graph of cumulative percentage of germination versus time will display a curve in


asigmoid shape. There are several models that can be used to adjust this type of behavior.
When it was modeled using the Gompertz function it was found a good fit.
To do this select Menu STATISTICS NON-LINEAR REGRESSION. In the window
Non-linear regression is asigned as Dependent variable to "germination" and as
Regressor "day". In the dialog box, specifically the non-linear regression analysis,
select in Non linear model with one predictor the Gompertz model. Finally press the Go
button and wait for completion of stages of estimation. In the Output window will show the
values of the adjusted parameters set. The following chart shows the cumulative
germination versus days after germination with the graph of the adjusted function and the
output displayed in the Output window.

130

100

100

75

75

Germination

Germination

Statistics

50

50

25

25

2 3 4 5 6 7 8 9 10 11 12 13 14

9 10 11 12 13 14

Day

Day

Observed values (accumulative) of


germination versus observation day

Observed values of germination


(accumulative) and Gompertz fit model
versus observation day

Figure 20: Non linear regression. Archivo Germination.


Table 45:Non linear regression output. File Germination.
Non linear regression
Model Germination alfa*exp(-beta*exp(-gamma*Day))
Variable
Germination

N
28

MSError Iteration
18.54
4

Parameters
ALFA
BETA
GAMMA

LL PAR UL PAR Starting value Estimate


-1E30
1E30
94.00
89.07
-1E30
1E30
12.89
32.84
-1E30
1E30
0.47
0.68

S.E.
1.44
9.44
0.05

T
62.04
3.48
12.54

p-value
<0.0001
0.0019
<0.0001

InfoStat reports the expression of fit model and estimations of each of its parameters. In this
example all the terms of the model made a significant contribution.
The comparison of alternative models of nonlinear regression is based on several criteria. In
general it is intended that the mean square error (MSError) is extremely small, the number
of model parameters is as small as possible (principle of simplicity), the standard error of
estimated parameters as small as possible and that the estimated coefficients are not highly
correlated. Finally, the scatter plot of residues versus the predicted values can also serve
as model adequacy.
Note: The names of the variables used as regressors not should contain mathematical symbols,
brackets or other special characters as accents, or %.
131

Statistics

Correlation analysis
Under this heading are methods for calculating sample correlation coefficients, partial
correlation coefficients and direct and indirect effects in a path analysis (path analysis). All
these methods assume that there are two or more random variables surveyed on each of the
experimental or observational units. The interest is to obtain a measure of the magnitude
(and direction) of the association or covariation of each pair of variables.

Correlation coefficients
Menu STATISTICS CORRELATION ANALYSIS CORRELATION
COEFFICIENTS. In the window Correlation coefficients, specify variables for which you
want to get the correlation coefficient. The next window lets you choose between the
Pearson or Spearman correlation coefficients (Conover, 1999). The results are presented as a
matrix with the following characteristics: 1) The number of rows equals the number of
columns and equal to the number of selected variables, 2) The main diagonal elements are
all equal to 1 because they represent the correlation of a variable with itself, 3) Below the
main diagonal in the position j,i is the correlation coefficient between the i-th and j-th
variable from the list, 4) Above the main diagonal and the position j,i is the probability
associated with the test null hypothesis of correlation between the j-th and i-th variable from
the list.
The Pearson correlation coefficient is a measure of the magnitude of the linear association
between two variables does not depend on the measurement units of the original variables.
For variables j-th and k-th is defined as:

rjk
=

S jk
=
S 2j S k2

( xij x j )( xik xk ) /(n 1)


i =l

n
n

2
2
( xij x j ) /(n 1) ( xik xk ) /(n 1)
=

i l=
i l

where S jk is the covariance between j variable and k, S 2j and Sk2 variable are the variances
of the variables j and k respectively.
The sample correlation coefficient represents the covariance of the standardized sample
values. Assumes values in the range [-1, 1] and the sign indicates the direction of the
association (negative values occur when the average trend indicates that if a value in the
pair observed is larger than its mean, the other value is smaller than its average).
The Spearman correlation coefficient is a nonparametric measure of association based on
ranks, which can be used for discrete or continuous variables not necessarily normal. This
ratio can also be used to measure associations in ordinal qualitative variables. For variables
j-th and k-th is defined as:

132

Statistics
n

R( x

ij

Srjk =

i =1

n +1
) R( xik ) n

2
2

n +1 n
n +1
2
2

R
x
n
(
)
R( xij ) n

ik


2 i l
2
=
i l=
n

where R( xij ) is the range corresponding to the i-th observation of j variable and R( xik ) is the
range corresponding to the i-th observation of k variable, with i=1 ,..., n.

Partial correlation coefficients


Menu CORRELATION ANALYSIS PARTIAL CORRELATION, allows to obtain
Partial correlations between two or more variables after adjusting for the effects of one or
more additional variables (fix variables). The fix variables can be either continuous or
classification variables.
The partial correlation between two variables Y1 and Y2 adjusting for X can be interpreted
as the correlation coefficient between Y2 and the residuals from the regression of Y1 on X.
Invoking partial correlations in InfoStat, using the selector variables, you must indicate the
variables for which you want the partial correlation coefficient (Y variable) and any of the
variables which must be adjusted correlations (fixed variables). When Go is obtained a
matrix with the following characteristics: 1) The number of rows equals the number of
columns and equal to the number of Y selected variables; 2) The main diagonal elements are
all equal to 1 because they represent the correlation of a variable with itself, 3) Below the
main diagonal in the position i, j is the partial correlation coefficient between the i-th and jth Y variable and the fixed list for the X fixed variables by the user.

Path analysis
Menu CORRELATION ANALYSIS PATH ANALYSIS, allows to decompose the
correlation between two variables (X and Y) into a sum of direct effect of X on Y and the
indirect effects of X on Y by other independent variables of the system correlations.
It is well known that correlations between two variables cannot be used to establish causal
relationships. When one variable precedes another in time and/or you can postulate the
existence of causal relationship (and it is assumed that this is linear) linear models can be
used to express this relationship. The goal of path analysis or structural equation analysis is
possible to provide causal explanations of observed correlations between a response variable
(dependent) and a set of predictor or causal variables (exogenous or independent variables).
The notion of a causal relationship between the dependent and independent variable requires
that you have eliminated the effect of all other independent or causal variables recognized in
the system. In a linear regression model, Y =
0 + 1 X + , the last term (random error
133

Statistics

term) represents the collective effect of all unmeasured variables that could influence the
study variables. Regression, written in standard form is:
Y y
X2 X X
2

= 1
+
2
2

Y2

X2

it can also express as:


=
Z y p yx Z X + p y Z

The p parameters in the standard model are known as the path coefficients. The causal
model implies that the correlation between X and Y is p yx and that the model is selfcontained or completely determined as the contributions of the two terms of model to the
variance Z, to sum to 1, Var ( Z y ) = p yx2 + p y2 = 1 .
In the path analysis is to build models of cause and effect relationships between variables
through the partition of the correlation between two variables as the sum of two types of
effects. These are direct effects of one variable on another (simple paths) and indirect effects
of one variable on another one way or more exogenous variables (paths compounds). If you
consider a new variable in the previous system, say the U variable, and assume that there
is a linear relationships can be represented by the Y = 0 + 1 X + 2U + linear model,
path analysis will provide us information about the direct effects of X and U on Y (simple
paths in the diagram of the system) and also indirect effects of X on Y through U and U on
Y through X. The indirect effect of a variable X on Y by another variable U is defined as
pu rx ,u , where the p coefficients correspond to the standardized coefficients of multiple
regression of Y on X and U and rx ,u is the simple correlation coefficient between X and U.
Then, path analysis of this system involving two causal variables makes the following
partition of the observed correlation between Y and X and the correlation between Y and
U (excluding error terms).

r=
p y , x + p y ,u rx ,u
y,x
=
ry ,u p y , x rx ,u + p y ,u
Given a sample, it is possible to obtain values for all correlation coefficients involved in this
system of equations; the number of unknowns is always equal to the number of equations.
Solving the system are estimates of the direct effects.
In InfoStat the results of path analysis are presented in tables showing all direct and indirect
effects of the system under study. The coefficients help to determine their relative
importance. The conclusions of the analysis depend on the assumed linear relationship, so it
is advisable to check that the correlation between the system output variable and the error
134

Statistics

term is low, implying that there is no important causal factors that have not been
incorporated into the model.
Example 27: In an experiment on growth of weeds using 20 experimental units consisting of
trays with 40 seeds sown at the beginning of the experiment. It records the number of seeds
germinated after a certain time in allis obtained an indicator of leaf area and total biomass.
We intend to study the correlations of biomass with leaf area and number of seeds
germinated in a system where the biomass is considered a dependent variable. The data
is in the Path file.
Menu CORRELATION ANALYSIS PATH ANALYSIS, the window Path
coefficients allow to select the dependent variables and system variables. Indicate
"biomass" as the Dependent variable and "SeedGerm" and "Leaf_Area" as Independent
variables. At the Go in the Output window displays the following information.
Table 46:Path analysis output. File Path.
Path analysis
Dependent variable: Biomass; n=20
Effect
Via
Coefficients
SeedGerm
Direct
0.78
SeedGerm
Leaf_area
-0.02
Total r
0.76
Leaf_area
Leaf_area
Total r

Direct
SeedGerm

p-value

0.0001

0.03
-0.52
-0.49

0.0272

The correlation between biomass and leaf area was significant (r=-0.49, p=0.0272) and is
almost completely determined (-0.52) by the correlation between biomass and seed
germination.

Correlation between distance matrices


The correspondence between two distance matrices (or similar). The simplest way is to
consider the elements of the two distance matrices that are informative, i.e. n (n-1)/2distinct
elements outside the diagonal and obtain an estimate of the correlation block by block, as is
the coefficient of Pearson linear correlation.
Usually accompany the correlation coefficient with a scatter diagram of the elements on
both matrices in order to highlight pairs of outliers in the pattern of correlation. A more
appropriate test to assess the correlation between two matrices is the test based on Z of
Mantel statistic, whose significance is obtained by permutation (Mantel, 1967), because
the pairs of data with which to calculate the statistical correlation does not are really
independent. The Mantel test (1967) takes into account the autocorrelations of the elements
of a distance matrix. The test statistic is:
n

Z = xij yij
i< j

135

Statistics

where Xij and Yij are the elements i, j (elements outside the main diagonal) of the matrices X
and Y, respectively. The statistic is the sum of cross products of these elements, since this is
the only number that really is sensitive to permutations to be undertaken to assess the
significance of the association. If the two matrices show a relationship of similarity, then Z
should be large in comparison to the value of Z that on average would expect when using
uncorrelated matrices. The observed value of Z is positioned on the distribution of Z under
the assumption that matrices are not correlated which is obtained by permutation and the
probability of Z values greater or equal than to that observed is calculated. If it is probability
(p-value) is below the level of significance of the test, e.g. alpha 0.05, then is rejected the
hypothesis of no association and conclude that there is some overlap between the two
matrices.
The distribution of Z under the null hypothesis is obtained by comparing a matrix, say X,
with all possible matrices and where the order of objects (or variables) has been
commuted. Then, the procedure for the Mantel test applied to two matrices will then be to
calculate the amounts from the original matrices, permuting rows and columns of a matrix
* , and
while the other remains constant, each time recalculating the amount of Z XY
comparing this value with the original Z XY value.
As the permutation of the order of the objects had no effect on the mean and variance of the
matrix Y, it is important to note that the Pearson correlation coefficient product-moment is
monotonically related to Z. So, to detect a significant correlation with the Mantel test, you
may report the Pearson correlation coefficient as a measure of the magnitude of the
association. The last has the advantage of being expressed in more familiar units due to
standardization and therefore easier to interpret. InfoStat statistic reports on this scale.

Categorical data analysis


Contingency tables
Menu STATISTICS CATEGORICAL DATA CONTINGENCY TABLES, can
build cross-classification tables according to different classification criteria. Agresti (1990)
presents excellent treatises on categorical data analysis in which extensively covered the
topic of modeling and analysis of contingency tables. Some of the terminology commonly
used in the analysis of these tables is presented below.
Contingency tables (tabular data reporting forms categorized) are useful for the
simultaneous analysis of two or more categorized variables. A categorized variable is one in
which the measurement scale consists of a set of categories, such as type of home variable
can be categorized according to the following two categories "rural" and "urban". To
properly analyze and interpret contingency tables is necessary to take into account the scale
of measurement of variables involved and the type of study (randomization) used to obtain

136

Statistics

the data. Typically, the hypotheses of interest in contingency tables are refered to the
association between the variables that define the rows and columns of the table.
The categorized variables with levels that have no natural ordering are called nominal (e.g.,
political affiliation with categories "liberal" and "conservative"). A particular case is that of
the binary variables which involve 2 categories of nominal variables, such as "yes" and
"no", "answer" and "no answer."
If the levels are ordered the variable is called ordinal, for example, infection rate
categorized as "mild," "moderate" and "severe." Although the categories can be sorted,
unlike the quantitative variables absolute distances between categories are unknown. In
some situations, the tables can be constructed with variables measured on an interval scale,
this scale implies known numerical distance between any two levels of scale (e.g., intervals
of the age variable).
The variables that constitute the table can be considered as response variables or
classification variables. The first, also called dependent variables are random and describe
what was observed in the sample units. The second, also called independent variables or
factors, are fixed by conditioning, and combinations of its levels defined strata, populations
or subpopulations in which the sampling units belong. When all the variables in the table are
usually response we analyzed the association between them. When some are responses and
other classification, general studies the effects of the classification variables on the
distribution of response variables. If we denote by X to a categorized variable with I
categories or levels and another variable Y with J levels to classify subjects on both
variables exist IJ classification combinations.
The pairs (X, Y) associated with each subject randomly selected from a population have a
probability distribution. The distribution is presented in a table of I rows and J columns. The
probability associated with IJ event in general denoted by ij represents the probability that
the variable X takes the category I and variable Y assumes the category J. The set values of
ij form the joint distribution of both variables. The set values of i+ (total joint probabilities
of the row i) for i=1, ..., I, form the marginal distribution of the rows in the
table. Equivalently we can obtain the marginal distribution of the columns. When a variable
(say Y) is considered as the response variable and the other (say X) as an explanatory
variable is informative to identify the probability distributions of the response for each level
of X, understood the conditional distribution of Y given X.
The notion of independence is commonly used in contingency tables. Two variables (X and
Y) are statistically independent if the conditional distributions of Y are identical for all levels
of X. When both variables are considered as response variables is indistinct observe the
conditional distribution of Y given X and the conditional distribution of X given Y. Statistical
independence expresses the joint probabilities (probability of cell ij) as the product of
marginal probabilities, understood probability of row i and column j probability (expected
value under independence).

137

Statistics

Contingency tables can be used to display results from different types of studies: 1)
experimental studies, those where the researcher has control over the group of subjects, i.e.,
decide under what conditions will be observed each subject. These studies are prospective
and in the biomedical field are known as clinical trials (clinical trials), 2) observational
studies, which may be retrospective (case-control) or prospective (cohort, cross-sectional or
transverse). In the case-control kind is investigated the selected past arbitrarily a group of
individuals that have the characteristic under study (cases) and another group of subjects
who do not have to be used as a reference (control). This arbitrary choice precludes certain
inferences about Y. The marginal distribution of Y is determined by sampling and does not
necessarily correspond to the characteristics of the population. In the cohort or cross-cutting
part of a random sample of subjects which are classified into one of the ij cells of the table,
simultaneously, as appropriate. The marginal totals are so random (not fixed by the
experimenter).
Thus, the study design involves a particular type of sample which must be taken into
account when interpreting the statistics derived from the contingency table. Tipically, for
22 table, you understand I=2 J=2, identifies the following samples: 1) Poisson sampling,
each cell is an independent Poisson variable, derived from cross-sectional studies where
sampling is random and the total number of individuals (n) is not fixed, 2) binomial
sampling, each row of the table defines different groups and the sample sizes of the row are
fixed by the design (conditions), often is necessary to analyze the conditional distributions
to the rows which are modeled as 22 tables (in case of tables with J>2 is used the
multinomial model for each row), 3) multinomial sampling, cell counts are multinomial, the
total sample size is fixed but not the total set of rows or columns, 4) with n and fixed
marginal distribution of values by cell can be approximated to a hypergeometric
distribution.
Example 28: The following table corresponds to the classification of company employees,
according to the branch to which they belong and their views on the opportunities for
advancement in their jobs. The data is in the Categorized file.
Table 47: Number of employees according to their opinion over promotion opportunities in three
branches. File Categorized.
Branch
A
B
C
Total

Promotion opportunities
Low
Moderate High Total
205

174

138

517

199
152

184
167

118
227

501
546

556

525

483

1564

Menu STATISTICS CATEGORICAL DATA CONTINGENCY TABLES, allows


to identify in the Contingency table window variables to be used to form the criteria for
classification of rows and columns. To declare variables of Example 28 should be indicated
as Class variables to the columns "Branch" and "Promotion". The variable "Frequency"

138

Statistics

must be entered in the subwindow Frequencies. To OK, the flap Select rows and columns
should indicate that "Branch" defines the rows and "Promotion" the columns of the table.
Table 48: Contingency tables output. File Categorized.
Contingency table
Frequencies: Frequency
Absolute frequency
In columns:Promotion
Branch High
Low
A
138
205
B
118
199
C
227
152
Total
483
556

Moderate
174
184
167
525

Statistic
Value
Chi-square (Pearson)
48.84
Chi-square (ML-G2)
48.33
Contingency Coef. (Cramer).. 0.10
Contingency Coef. (Pearson.. 0.17

Total
517
501
546
1564

df
4
4

p
<0.0001
<0.0001

As the p-value p<0.0001 is concluded that there is a relationship between the branch and the
advice given.
InfoStat allows to analyze three-way classification tables. Optionally you can specify a
column in the file as a stratification criterion of the data. Suppose the same experiment was
conducted in two different provinces, then there will be two tables like the one above each
one constructed from the data of a particular province. The province variable can be marked
as a stratification criterion to the total data set. In this case InfoStat works with a set of
contingency tables (one table per strata) providing for each stratum a classical statistical and
each marginal table (summary table through all strata). If you assign more than one variable
in the Strata (optional) window, the strata will be defined from the crossing of these
variables. The hypothesis of association in both types of surveys (with and without strata)
explores the association between the branches (treatments) and the opinion regarding the
possibility of promotion (response), only when there are strata this association is deducted
after controlling or adjusting for the effect of the stratum.

Organizing data
Two different kinds of formats can be used to build contingency tables:
Expanded format: each case contains registered values for the variables X and Y for each
unit of study or subject. For example if you have as classification variables "sex" (2 levels)
and "smoking" (2 levels) and recorded 45 data for each sex, there will be a total of 90 data,
the information recorded for each individual will form a case. The file will have two
columns ("sex" and "smoke") and each of the 90 rows "sex" will take the value F or M and
"smoke" the value Yes or No as has been the response of the individual concerned.
139

Statistics

Format using frequencies: if you have counts for each table cell you can use a format of
three columns, one identifying the levels of the rows, another with the levels of the variables
and the third column with counts each combination of rows per column levels. For the
above example, the table has a total of 4 cells, using the frequencies in each cell; you would
enter data as follows:
The variables (columns of the table) "Sex" and "Smoke" will be assigned as Class criteria
and column Frequencies (see also Contingency file) in Frequencies (optional-only one). If
the relationship between sex and smoking will be held in 3
different populations, a fourth column identifying the
source (population) of each count may be listed as Strata.
The second window contains two tabs: Select rows and
columns and Options. In the first will be listed the selected
file variables as classification variables for the user to
specify which variable should be used to build the table
rows and columns. You can specify one or more variables
for rows and columns.
If you have more than one variable assigned to rows
(columns), the table will have many rows (columns) as
level combination has the selected variables. For example,
if a variable is "sex" and the other is "blood group" and both variables are assigned to rows,
then the table will have at most 8 rows (2 sexes)(4 blood groups). In the same window if
you select the option All pair-wise tables, InfoStat ignores all level combinations of
variables declared in rows and columns and calculates the contingency tables for all pairs of
variables listed in rows and columns. If no variables are assigned to rows, the table will have
only one line to include information about the columns and vice versa. If you keep the
option Presentation in alphabetic order (enabled by default), the contingency table was
constructed with values of the variables in rows and columns, in alphanumeric
order. Turning off this option, the tables will be produced with the order of rows and
columns responding to the way data were entered in the file. The option should be
considered carefully when working with ordinal response variables in alphanumeric order
details of which do not correspond with their natural order, for example, the variable level
of infection categorized as low, moderate and high where the alphabetical order (high , low,
moderate) does not correspond to the ordinal categories.
In the Options flap you can choose the information you want displayed in the contingency
table and statistical reporting. You can get tables containing: Absolute frequency, Relative
frequency by rows, Relative frecuency by columns, and Relative frecuency (total). You
can also request: Expected frequency under hypothesis of independence, Residuals
under hypothesis of independence, Standardized residuals under hypothesis of
independence, which is constructed from the above table divided by the square root of the
expected value, and Adjusted residuals. The default relative frequencies are reported as

140

Statistics

values in the interval [0,1], if you want to display this information in percentage you should
activate the option Relative frequencies as percentages.
For tables of any size you can apply statistical hypothesis tests based on the approximate
Chi-square distribution activating the option Chi square. In this case, InfoStat reports the
values of the Pearsons Chi-square statistics, maximum likelihood Chi square or G2 (Chi
square G2-MV), the Cramer's contingency coefficient (Cramer contingency coefficient),
the Pearson contingency coefficient (Pearson contingency coefficient) and p-values of the
respective hypothesis tests. All statistical measure general types of association.
In tables IJ with multinomial sampling, the hypothesis of statistical independence implies
that the expected frequency for cell ij corresponds to the product of the marginal frequencies
of the row i and column j, to test this hypothesis InfoStat provides the Pearsons Chi-square
statistic, whose expression is:

2 =
i

(nij m ij ) 2
j m
ij

ij is the estimate of the expected


where nij represents the sample count of the cell ij and m
ij = npi. p. j with pi and pj the relative frequencies of the
absolute frequency obtained as m
rows i and column j, respectively. The statistic is distributed (under the null hypothesis)
as a Chi-square with (i-1)(j-1) degrees of freedom. High p-values (higher than the nominal
significance level of test) imply that there is no enough sample evidence to reject the
hypothesis of independence between the row variable and column variable. If the pvalue leads to rejection of the null hypothesis of independence then conclude that there is an
association between two variables.
Chi square statistic with v degrees of freedom can be partitioned into v chi-square
independent components with one degree of freedom each. This result allows us to
amalgamate columns of the table to test independence in another column with a Chi square
statistic with (i-1) degrees of freedom.
The sample size required for the Pearsons Chi-square statistic states that all expected
values under the hypothesis of independence should be greater than or equal to 5. The
application of this criterion to other statistics for association to take advantage of the
nature of the data can be very conservative (Agresti, 1990).
InfoStat also provides the test based on the ratio of maximum likelihood (Chi square MVG2) for the scenario discussed above. The statistic, denoted as G2, is:
G2 = 2

ij

log( nij / m ij )

The statistic under the null hypothesis is also distributed for large samples as a Chi-square
with (i-1)(j-1) degrees of freedom. Then Chi square and G2 are asymptotically equivalent.
The asymptotic results obtained assuming a multinomial sampling also hold for other types
141

Statistics

of sampling (Agresti, 1990). Both statistics are invariant to permutations of the order of
rows and columns (treat the variable categorized as nominal).
Other measures of association are provided by InfoStat are coefficients based on the Chisquare statistic (index values that summary association). These are the Cramer contingency
coefficient and Pearson's contingency coefficient that is calculated as follows:
Crammer contingency coefficients:
=
V 2 X 2 /( n min(i 1, j 1))
Pearson contingency coefficients:
=
PCC ( 2 /( 2 + n))1 / 2 .
The values of both coefficients are between 0 and 1, values near zero imply independence of
row and column values in the table. In Tables 22, in addition to requiring the above
statistics may apply to random sampling, the Fisher exact test (Irwin-Fisher) and the
following measures of association: chance ratios (odds ratios) and Relative risk and Phi
coefficient. For the particular case of independent samples can be obtained
by the McNemar test.
Fisher's exact test allows testing the hypothesis of independence without having to work
with asymptotic approximations. The exact distribution of the statistic under the null
hypothesis is based on the distribution of the conditional counts on marginal frequencies.
The observed frequency distribution is the hypergeometric distribution and do not depend
on any unknown parameter. The exact distribution can be expressed in terms of cell counts
in 1.1 (n11). The range of possible values of n11 with fixed marginal is known, then it is
possible to construct all possible tables under the hypothesis of independence and accurately
calculate the probability of obtaining values greater than the observed statistic from the list
of those tables. InfoStat provides exact p-values for tests of bilateral and unilateral
(right and left) independence in Tables 22.
Chance ratios (odds ratios) are frequently used measures of association. The ratio of chance
on a 22 table is defined as the ratio counts products that are diagonal in the table, so also is
often called cross-product ratio. The sample estimate of the ratio of chance is

n11n22
n12 n21

Chance sampling ratios cannot be defined in cases where the two entries in a row or column
are zero. As this event has positive probability, the expected value and variance of the
estimator do not exist (Agresti, 1990). Using logarithmic transformation and expressing the
estimate by a multiplicative rather than additive structure, is shown for sample types
described, the odds ratio estimator is distributed asymptotically normal and its asymptotic
standard error is:
1/ 2

1
1
1
1
(log ) = +
+
+

n11 n12 n21 n22


142

Statistics

InfoStat uses this expression by replacing nij by (nij+0.5) due to the replacement improves
the estimator (Agresti, 1990). It also provides, using the normal approximation, confidence
interval (with confidence coefficient 95%) for the chance ratio calculated using the group 1
in the numerator (1/2) and its reciprocal, i.e. the ratio chance of success, in group 2
compared to 1 (2/1).
For example, a chance ratio of 3 with a confidence interval that does not include 1 implies a
significant association between the row variable and column variable. The value 3 means
that the group of cases, in row 1, has a triple chance of registering a "success" in the
response variable than those in group 2. That is to say those cases in group 2 have a third
of the chance to record a "success" in the response variable that cases in group 1.
To compare the conditional distribution of a column of the response variable within each
row, it provides statistics known as relative risk. If the counts in each row of the table is
distributed as a binomial (two rows are independent binomial), the difference of proportions
in a column between the two rows can be evaluated through simple confidence intervals
from the normal approximation (see Difference in proportions Statistics menu, Two-sample
inference submenu). Sample relative risk (RR) is obtained as:
RR =

p1 / 1
p1 / 2

where pi/j is the estimate of the probability of getting the answer i given that the case belongs
to row j. The relative risk is always greater than or equal to zero, values equal to 1 suggest
that the type of response probabilities in question are the same for both rows. The RR can
have high values when both probabilities are close to 0 or 1, whereas in such cases the
difference in proportions would be very small. For example, if the probability of success in
row 1 is 0.01 and in row 2 is 0.001, the difference in proportions will be of 0.009 while the
RR is 10.
Phi coefficient also measures the association, is based on the Chi square statistic and it is
calculated only for 22 tables (Conover, 1999). The Cramer's contingency coefficient in 22
tables corresponds to the measure of association known as Phi squared and it is also
equivalent to Goodman and Kruskal tau (Agresti, 1990) (measure of proportional reduction
in variation).
The McNemar test is used to evaluate the significance of changes in the category of
bivariate responses (paired observations for each individual). It is assumed that each pair has
n independent random variables (Xi, Yi) categorized dichotomous. Then the possible values
of the i-th observation are (0,0), (0,1), (1,0) and (1,1) that usually occur in cells as a
contingency table of 22. McNemar's statistic has the form:

T1

(b c )
=

b+c

143

Statistics

where a, b, c and d is the number of counts and observations of the sample falling within the
cells (0,0), (0,1), (1,0) and (1,1) respectively. This statistic does not depend on a and d since
they represent the number of cases where both variables assume the same value and are
discarded from the analysis because it aims to study changes in the modalities of the
variables. As an example, Conover (1999) presents a situation where a survey is conducted
on voting intentions for both political parties before and after a televised debate between the
candidates of both parties. Interested in evaluating if after the debate was no significant
change of intent to vote. If the zero value of X represents the intention of votes for A
candidate before discussion and the zero value of Y intent to vote for A candidate after the
debate, hypotheses that are tested are:
H0: P(Xi=0, Yi=1)=P(Xi=1, Yi=0)
H1: P(Xi=0, Yi=1)P(Xi=1, Yi=0) for i=1, ,n
InfoStat uses the asymptotic approximation of the statistical distribution of T1, which is a
Chi square with one degree of freedom when b+c is large and when b+c is less than 200 the
probability values derived from the exact distribution T2=b statistic whose distribution
is Binomial with parameters (b+c, 0.5).

Three-way contingency tables


Studies of association between a response variable Y and explanatory variable X where
there are one or more control variables (Z) that define strata, you must use the information
on stratification to study the association between X and Y without confounding due to the
presence strata. One way to eliminate the effect of Z is to work the associations between X
and Y for each level of Z (partial tables by strata). Another is to work with all data but
discounting the effect of Z.
InfoStat allows to generalize the analysis of two-way tables for tables of three or more ways
by: 1) the construction of all partial tables, i.e., those corresponding to each level of Z. To
do this select Show partial tables, which effectively removes the effect of Z, 2) the
marginal table construction, i.e. one where for each cell shows the sum of the counts of the
partial table cells (sum of Z), ignores the effect of Z, 3) Cochran-Mantel-Haenszel test
(Agresti, 1990) for the association between X and Y, and Z controlling for cases of 22
tables the Mantel-Haenszel ratio of chance. These tests and related statistics prove the
hypothesis on the relationship between X and Y using all the information in the tables
generated by the levels of Z. The statistic of the Cochran-Mantel-Haenszel test (CMH),
which is distributed as a chi square with one degree of freedom, combining information
from all partial tables and it is given by:

( n11k 11k )

CMH = k

Var ( n )
11k

144

Statistics

where the summation in k is the sum across all the partial tables, n11k is the count in 1,1cell
of the k-th table, 11k and Var (n11k) are the hope and the variance of counts in cell 1,1
respectively. The Mantel-Haenszel ratio of chance (controlling for Z) is given by:

MH =

(n
(n

11k

n22 k / n++ )

12 k

n21k / n++ )

where the summation in k is the sum across all the partial tables, nijk is the count in cell i,
j of the k-th table, and n++ is the total number of observations in the same table.
Example 29: In a cross-sectional study involved 676 students, took into account the
condition of approval (1) or no (2) of the entrance exam ("Exam") and the type of
preparation for the exam: the student is prepared alone (2) or in an academy (1)
("Preparation"). As students came from two schools, one wants to study the relationship
between the approval and type of preparation controling by academic unit of origin
("Faculty"). The data is in the University file.
Menu STATISTICS CATEGORICAL DATA CONTINGENCY TABLES, allows
for the contingency 22 table associated with this problem and correspondant association
statistics. It was selected "Exam" and "Preparation" as Class variables, "Faculty" as Strata
and "Count" as Frequencies. In the next window is named "Exam" for columns
and "Preparation" for rows, which gives the output in Table 49.
Table 49: Three-way contingency table. File University.
Contingency table
Frequencies: Count
Stratification Criteria: Faculty
Marginal Table
Absolute frequency
In columns:Exam
Preparation 1
2
Total
1
54 16
70
2
434 172
606
Total
488 188
676

Statistics for the marginal table


Statistic
Value df
p
Chi-square (Pearson)
0.95 1 0.3286
Chi-square (ML-G2)
0.99 1 0.3200
Irwin-Fisher (two tails)
0.06
0.3982
Contingency Coef. (Cramer).. 0.03
Contingency Coef. (Pearson.. 0.04
Phi-Coefficient
0.04

145

Statistics

Odds ratio
Statistic
Estim. LL 95% UL 95%
Odds Ratio 1/2
1.34
0.75
2.38
Odds Ratio 2/1
0.75
0.42
1.33
Statistics adjusted by stratum effect
Cochran-Mantel-Haenszel test
Statistic df
p
4.29 1 0.0383
Odds ratio (Mantel-Haenszel)
Statistic
Estim. LL 95% UL 95%
MH Odds Ratio(1/2)
0.49
0.32
0.75
MH Odds Ratio(2/1)
2.04
1.33
3.12

In this example, no significant association was detected between the approval of the
entrance exam and if the student was prepared alone or in academia from the marginal table
(without controlling for ability), the p value of Pearsons Chi-square statistic is 0.3286,
higher than the significance level =0.05, so there is no evidence to reject the null
hypothesis of independence between the approval of the entrance examination and in
preparation for the same conducted by the student. The confidence interval of the ratio of
marginal chance for the table includes the value 1, so you must conclude that the chance of
passing the exam if the student was prepared in an academy is the same as if one was
prepared.
association measures obtained on marginal tables can lead to false interpretations if the
variable Z has a significant effect on X, Y or both. In this example, controlling for ability
(Z), the hypothesis test of Cochran-Mantel-Haenszel suggests the existence of a significant
association (p=0.0383) and the Mantel-Haenszel ratio of chance (MH odds ratio 1/2) whose
value is 0.49, suggests that the chance of passing the exam if prepared in academia is about
twice the chance of passing the exam if the person was prepared alone.
The associations obtained through partial tables (not shown in the table above) are known
by the name of conditional associations because they study the association of X and Y
conditional on a fixed value of Z. In this example, Faculty 1 for the association exists but
to the Faculty 2 is not significant.

Logistic regression
Menu STATISTICS CATEGORICAL DATA LOGISTIC REGRESSION allows
for modeling the relationship between a response variable dichotomous nature in relation to
one or more independent variables or regressors. The coefficients of the linear combination
that models this relationship can estimate the odds ratio (odd ratio) for each regressor
variable.

146

Statistics

The logistic regression model can be used to predict the probability (pi) that the response
variable takes a particular value, for example, probability of success (y=1) in a dichotomous
variable that takes the values 0 and 1.
For a binary response, the simple logistic regression model, i.e. a regressor, is as follows:
Logit ( pi ) =
log( pi (1 pi )) =
+ Xi

where pi is the probability of success given Xi, is the intercept (constant), is the slope or
regression coefficient associated with X, and X is the explanatory variable. Then, in logistic
regression, models Logit transformation of the probability of success as a linear function
of one or more explanatory variables.
The logistic model can be seen in the context of a more general class of models, which
establishes a linear model for g(), where g() is a function of expected value of the
response variable and g is a function known as function link. In logistic regression the
canonical link corresponds to logit function of the expected value of the random variable p
which is observed binomial distribution (Hosmer and Lemeshow, 1989, Seber and Wild,
1989).
If is the symbol to the linear predictor, in the simple regression case = + X i ; the
probability of success will be estimated by:
p =

e
1 + e

InfoStat allows to adjust logistic regression models where there is a binary response variable
and one or more regressors which may be continuous or categorical. If a categorical variable
has more than two categories, you should use the generator of auxiliary variables (dummy)
(see Data Management) because InfoStat assumes that the regressors are binary categorical.
InfoStat calculates for each of the variables in the model the regression coefficient, its
standard error, the estimated odds ratio, confidence interval, -2(L0-L1) and the p value for
hypothesis test H0: i=0 versus i0.
It also presents the log likelihood for the model chosen (L). Column -2(L0-L1) contains -2
times the difference in log likelihood between the reduced model (L0) and the complete
model (L1). The reduced model for the i-th row of the table is the specified model by the
user (L) without the regressor corresponding to row i. Therefore, this column is the test of
maximum likelihood ratio for the hypothesis of null regression coefficient for the variable in
that row. The test of H0:i=0 is made from the Chi square statistic with one degree of
freedom. InfoStat can also save the residuals and predicted values for assessing fit model.
Example 30: Following is an example that examines the effect of age, initial weight loss and
% of normal weight (PPI), gender (1: Male, 0: female) and a measure of overall condition
fo the patient (PSPI) on survival in lung cancer patients evaluated three months after
starting treatment (data: courtesy Dr. Norma Pilnik, Hospital Trnsito Cceres de Allende,
Crdoba). The data are on Logistics file.
147

Statistics

It makes adjusting a multiple logistic regression model of the variable "Dead" (to 0 if the
patient is alive and 1 if the patient died) in relation to age, gender, Initial_weight% and
General_condition. The variables involved in the analysis are declared in the window
Logistic regression, in Dependent variables included the variable "Dead" and as
Regressors "Age", "Gender", "Initial_weight%" and "General_condition".
Table 50: Logistic regression analysis. File Logistics.
Logistic regression
Parameters
constant
Age
PPI
SEX
PSPI
Log-Likelihood

Est.
-7.59
0.03
0.12
-0.37
0.88

S.E
2.71
0.04
0.05
0.86
0.52

O.R. wald LL(95%) wald UL(95%)


5.1E-04 2.5E-06 0.10
11.80
1.03
0.96 1.11
0.74
1.12
1.02 1.24
5.46
0.69
0.13 3.75
0.17
2.41
0.87 6.67
3.11
-36.57

p-value
0.0006
0.3905
0.0194
0.6788
0.0776
___

According to previous results, the initial weight loss (Initial_weight%) is the only variable
showing a significant relationship with patient survival. General_condition is suggested as a
possible significant predictor for a significance level = 0.10.
If the response variable is continuous, and you want converted it into binary, in the dialog
box that appears after the variable selector, you can specify a threshold value to transform
the binary response variable. In the bottom of the window where it says Values over the
mean are consider success is enable a field to enter the threshold (numeric value) from
which the response is deemed successful, to successes InfoStat will do y=1 and and for
failures y=0.In the same window you can specify whether to save the residuals and/or the
predicted values by the model.

Kaplan-Meier survival analysis


Menu STATISTICS CATEGORICAL DATA OF KAPLAN-MEIER SURVIVAL
CURVES, make a graph of survival curves using the algorithm of Kaplan and Meier and
calculate the standard errors displayed in the table of survival according to the description
given by Altman (1991).
The analysis of Kaplan-Meier curves allows to study the survival of entities in terms of a
dichotomous independent variable (alive or dead). InfoStat also calculates the statistical
value called Log Rank to test the equality of k>=2 survival curves. Thus a high value of Log
Rank presents a correspondence with a small p value, less than some pre-specified
significance level, for example 5%, indicating that at least one of the k survival curves
compared are different.
The data sheet must be at least two columns, one indicating the survival time and the other
to indicate the independent variable that refers to the state it is the individual, this may be: 0:
alive, 1: killed by specific reasons, 2: died of other causes that are not of interest,
abandonment of the protocol or loss of the subject. Can also be processed data files where
the independent variable is dichotomous: 0: alive and 1: dead. If there are two or more
148

Statistics

groups (say k) of subjects you have to indicate in a third column to act as a rating factor with
k levels.
Example 31: The following is an example in which we studied 56 patients with multiple
myeloma for which we analyzed the survival (in months) under two conditions: bone
marrow transplant patients (T) and non-transplant (NT). For them there was also recorded
its status as alive or dead (Censure Code: 0: alive and 1: dead). Data courtesy of Dr.
Veronica Ortiz Corbella, Allende Hospital, Crdoba and are on Survival file.
There was invoked menu STATISTICS CATEGORICAL DATA KAPLANMEIER SURVIVAL CURVES, in the window of Kaplan-Meier survival curves was
designated a "Survival (months)" as Survival time, "Censure_code" as a Censoring code
and "Transplant" was invoked and Classification variables (optional). The results were:

Survival curves (Kaplan-Meier)


1.0
0.9

27 Months

Survival

0.8
0.7
0.5
0.4
0.3
0.2
0.0 13.9 27.7 41.6 55.4 69.3 83.2 97.0 110.9 124.7138.6

Survival time

The value p=0.001345 is


less than the significance
level
=0.05
to
researcher required for
the test, this indicates
that the difference in
survival between the
groups (NT and T) is
significant.
The group of the
untreated (NT) has a
survival
curve
that
decreases faster than that
of the transplant (T).
This chart shows that
50% of the untreated live
up to 27 months.

Figure 21: Kaplan-Meier survival analysis. File Survival.

Sensitivity-specificity curves
Menu STATISTICS CATEGORICAL DATA SENSITIVITY-SPECIFICITY
CURVES, allows to obtain empirical ROC curves, graphs of sensitivity and specificity
(separate or simultaneous) and positive and/or negative predictive value graphs. Using the
following table defined the calculations performed InfoStat:
Condicin (+)

Condicin (-)
149

Statistics
Forecast variable (+)

True positives

False positives

Forecast varable (-)

False negatives

True negatives

Note: Condition refers to the true state of the entity that is classified that can be positive or
negative. The user can enter this term or define it in terms of a value (median, mean or
otherwise) of the selected variable to determine the condition. Forecast variable refers to a
variable whose values (greater than or less than...) lead to predict the condition.

Sensitivity
=

Especificity =

True positives
100 ;
True positives + False negatives

True negatives
100 ;
True negatives + False positives

=
Positive predictive value

True positives
100 ;
True positives + False positives

Negative predictive value


=

True negatives
100 .
True negatives + False negatives

ROC curves are constructed by plotting sensitivility (Y axis) vs. 1-specificity (X axis).
Using the sensitivity and specificity curves, superimposed in the same graph, you can
determine a cutoff point commonly called "threshold" (value at which both curves intersect)
to determine the value of the predictor (prognostic) for which sensitivity and specificity are
equal.
Example 32: Following is an example in which patients with tumors were studied, all bone
marrow transplant. These patients showed different blood parameters including
lymphocytes to 15 days after transplanting ("Linpho15"). Survival was also recorded
("Survival (months)".) Using the median survival was obtained the patient's condition (0:
less than the median survival and 1: greater than the median survival). The objective was to
determine the value "threshold" of lymphocytes (the specific value that is equal to the
sensitivity). Data courtesy of Dr. Veronica Ortiz Corbella, Allende Hospital, Crdoba and
are on Sensivity file.

150

Statistics

There was invoked Menu STATISTICS CATEGORICAL DATA SENSITIVITYSPECIFICITY CURVES, in the window, Sensitivity-specificity curves were designated
"Survival (months)" as Response, and "Linpho15" as Test-variables (if you specify more
than one prognostic variable the calculations were performed for each independent
variable). In the next window, in the space provided to specify the condition you checked
the option Values greater than or equal to the median. In the reserved space to specify
the categories of prediction is indicated to considere positives that patients with lymphocyte
values greater than or equal to each of the observed values. Simultaneously were requested
sensitivity and specificity curves. The graph shows below presents the point of intersection
of both curves or "threshold" is 550 cel/L approximately.
100.0
88.9

550

Percentage

77.8
66.7
55.6
44.4
33.3
22.2
11.1
0.0
0.0

691.3

1382.5

2765.0

2073.8

3456.3 4147.5

Lympho15
Sensitivity

Specificity

Figure 22: Sensitivity-specificity curves to determine the cutting point for parameter values.
File Sensitivity.

151

Multivariate statistics

Multivariate Analysis
The multivariate analysis is used to describe and analyze multidimensional observations to
collected data obtained on several variables for each of the units or cases under study.
Multivariate Analysis module provides the user with a set of appropriate analytical
techniques for data tables that contain two or more response variables (table columns) for
each case (table rows). Then, the organization of data for multivariate analysis takes the
form of a matrix with n rows (cases) containing p features (variables) recorded on the same
individual (data in p variables observed in each of n cases, are collected in a matrix X,
np).
Table 51: Multivariate data organization.

Cases

V1

V2

X 11

X 12

...

X1 j

...

X1 p

X 21

X 22

...

X2 j

...

X2p

...

...

...

...

X n1

X n2

...

X nj

...

X np

Vp

Vj

Each observation is represented by a multivariate p-dimensional vector of random variables


and they can be conceptualized as a point in R p, with coordinates equal to the value
assumed by each of the variables. If you have 3 individuals and 2 random variables (e.g.
height and weight) recorded on each individual, assuming the values shown below, the
representation of the three bivariate observations may, in the space of two variables be:
3.1

Case Weight Heiht


weight

2.8

2.2

1.9

2.5

0.8

1.4

2.0
height

2.6

3.2

Figura 23: Multivariate data representation.

153

Multivariate statistics

When more than three variables are surveyed in each


case a direct visualization of the observations cannot;
therefore, you
have
to
use dimension
reduction techniques and projections of the cloud of
points representing the observations in space and it
is easy to see the plane. Commonly used graphics to
display and compare multivariate observations are star
charts, matrices of scatter
plots,
graphs
of
multivariate profiles (see Chapter Graphics).
InfoStat allows to apply analytical techniques to
understand the relationship between variables
measured simultaneously, compare, group, and/or
classify observations based on several variables or
variables based on observations. When you select
Menu STATISTICS MULTIVARIATE
ANALYSIS, the following submenu is displayed.
When deciding what multivariate analysis use, a window that will indicate the variables to
analyze and establish, if necessary, a classification criterion.

Multivariate descriptive statistics


Menu STATISTICS MULTIVARIATE ANALYSIS MULTIVARIATE
DESCRIPTIVE STATISTICS. In the selector window Multivariate descriptive statistics
you have to specify response variables in Variables. The Class variables is optional. If
there are one or more columns in the data table to sort the observations into groups; they can
be identified as classification criteria to reduce the size of the matrix of observations. In this
case InfoStat will use as the unit of analysis each
one of the groups formed by the classification
criteria.
For the calculation of descriptive statistics, the
number of observations used by InfoStat is the
number
of active cases. If
you want to
remove the cases analysis with at least one
variable with missing data you must check the
box Delete Incomplete Records at the foot of
the
window
Multivariate
descriptive
statistics (default option). InfoStat automatically
provides
the vector/s
meassure
by group, the unbiased covariance matrix and
correlation matrix. Other statistics that can
be selected, activating the appropriate box on the Options window, are presented below:

154

Multivariate statistics

Theoretical notions of multivariate descriptive statistics


The description of multivariate random samples can be done by calculating sample
statistics. Johnson and Wichern (1998) define a random sample of multivariate observations,
such as one where: 1) measurements taken on different cases are not correlated, and 2) the
joint distribution of the p variables is the same for each case. Below are some commonly
used multivariate descriptive statistics for the description of multivariate random samples.
Mean vectors (total and per group): for each variable of the data matrix to calculate the
sample mean. If the data matrix is of dimension np will be p sample means denoted by x j
with j=1,..., p, each obtained as:
n

xij

i =1

xj =

for j=1,...,p

where xij is the i-th value of the j-th variable. The p means are, in this case the total mean
vector. If it was indicated a classification criterion can be obtained by the mean vector by
group formed from the average of the p variables calculated from observations of each
group.
Variance-covariance matrices: the sample variance, calculated from n measurements on
each variable will be denoted by S 2j . For a data matrix of dimension np will be p sample
variances, each obtained from the expression:
n

=
S 2j c ( xij x j ) 2

para j=1,...,p

i =l

where the constant c can be 1/n or 1/(n-1) under the case of maximum likelihood estimator
or the unbiased estimate of population variance, respectively.

155

Multivariate statistics

The sample covariance measures linear association between two variables, i.e. measures
how two variables vary together. The covariance between the j-th variable and the k-th
variable is obtained by the following expression:
n

S jk = c ( xij x j )( xik xk )

for j,k=1,..., p

i =l

In requesting a variance-covariance matrix InfoStat has variances and covariances of the p


variables in a square matrix pp symmetric. This matrix contains the variances of each of
the p variables on the main diagonal and the covariances between each pair of variables and
elements outside the main diagonal. Then, if S is denoted to the matrix, this will be as
follows:
S12
S
21
.
S=
.
.

S p1

S12
S 22
.
.
.
S p2

... S1 p
... S 2 p

... .
... .
... .

... S p2

The variance-covariance matrix has p variance and [p (p-1)]/2 covariances. In the variancecovariance matrix of maximum likelihood constant c used in the calculation of each
element is 1/n, and the unbiased variance-covariance matrix, the constant c is 1/(n-1). In
the Total covariance matrix (in either of its versions), all n observations are used in the
calculation of variance and covariance.
In the Common covariance matrix (in either of its two versions), the resulting matrix is
obtained from the weighted average of the variance and covariance matrices for each group.
Then, if there is a classification variable that separates the data matrix into two or more
groups, the common covariance matrix is obtained through the weighted average of the
estimated covariance matrices in each group separately. For example if you have two
groups, the common covariance matrix is:
S comun =

(n1 1)S1 + (n2 1)S 2


n1 + n2 2

where S1 and S 2 are the covariance matrices in group 1 and group 2 with (n1 1) and (n2 1)
estimated degrees of freedom, respectively. This common covariance matrix has sense
when there is no difference between the covariance matrices of each group.
Total correlation matrix, is a pp symmetric square matrix containing value 1 on the
diagonal and Pearson correlation coefficients between each pair of variables and elements
outside the main diagonal. The Pearson product-moment correlation coefficient is
a measure of the magnitude of the linear association between two variables does not depend
on the measurement units of the original variables. For variables j-th and k-th is defined as:

156

Multivariate statistics

rjk
=

S jk
=
S 2j S k2

( xij x j )( xik xk )
i =l

n
n
2
2
( xij x j ) ( xik xk )
=
i l=
i l

The coefficient has the same value when S jk , S 2j and Sk2 are expressed with no divider n or
(n-1). The sample correlation coefficient represents the covariance of the standardized
sample values. Assumes values in the range [-1, 1] and the sign indicates the direction of the
association (negative values occur when the average trend indicates that when a value in the
pair observed is larger than its mean, the other value is smaller than its average).
Common correlation matrix: is a symmetric pxp square matrix containing the value one
on the main diagonal and the Pearson correlation coefficients between each pair of variables
and elements outside the main diagonal. Unlike the total correlation matrix, the coefficients
are calculated after adjusting for the effect of group (a group defined from a criterion of
classification). This matrix corresponds to the partial correlation matrix of each pair of
variables, adjusted for the effect of group. For example, if a matrix of data about height and
weight of individuals contains comments from two groups and does not take into account
this classification, you might get a negative correlation when in fact that is positive in both
groups (see Figure 24). Then, we must consider that there are two groups with different
means for positive correlations between the two variables. This type of correlation is known
as partial correlation,s and are reported by InfoStat when you ask for the common
correlation matrix.

Figure 24: Scatter plots for Weigth versus Heigth for two groups (yellow and blue).
Matrix of sums of squares and cross products: it is a symmetric pxp matrix constructed in
the same way as S but with c=1. The sums of squares in the main diagonal and crossproduct sums beyond, are themselves important sample statistics. This matrix can be
constructed from the n observations (total matrix) or as a weighted average of
the matrices for each group (common parent) as explained above. Algebraically:

157

Multivariate statistics

x11 x1
x12 x2
.
W = (n-1) S = .

.
x1 p x p

x21 x1
x22 x2
.
.
.
x2 p x p

... xn1 x1 x11 x1


... xn 2 x2 x21 x1

. .

. .
. .
... xnp x p xn1 x1

x21 x2
x22 x2
.
.
.
x2 n x2

... x1 p x p
... x2 p x p

... xnp x p

Singular value decomposition of data matrix: a way of describing a data matrix. InfoStat
allows to apply this decomposition to the active data table or matrix of covariance,
correlation and/or sums of squares. The function applied to a matrix allows to obtain the set
of eigenvalues and eigenvectors that describe it. A rectangular matrix X of np dimensions
can be written in terms of its composition by singular value as follows X = UDV ' , where U
is np with orthogonal columns, V is an orthogonal matrix pp and D a diagonal matrix
pp, which elements area called singular values of X .
The spectral decomposition of a matrix can be seen as a particular case of decomposition,
which applies to square and symmetric matrices, in this case U and V are equal. InfoStat can
apply this type of decomposition to any matrices listed on the menu multivariate descriptive
statistics. With Intermediate reconstruction steps option provides, in the order specified
by the user, r, the reconstruction of the matrix is breaking from r pairs of eigenvalues and
eigenvectors. The user can view the reconstruction step by step. When r is equal to the
number of nonzero eigenvalues, the last step of reconstruction returns the original matrix.
Eigenvectors and eigenvalues of S: both summarize the information in terms of variability.
The eigenvectors constitute a set of base vectors to plot the data and the eigenvalues
represent the variability of the data in each of the directions given by the eigenvectors. Then
the eigenvalues are measures of variability, while the eigenvectors express the direction of
the variability.
Determinant: the determinant of S also known as generalized variance is a way of
summarizing multivariate variability and summarizing information on all variance and
covariance between variables in a single number. Calculated on S can also be seen as the
product of its eigenvalues.
Many of covariance matrices can have the same value of generalized variance. The
generalized variance is zero when at least one deviation vector ( x j x j ) is a linear
combination of others. An important result states that if pn (more variables than
observations) the determinant of the covariance matrix will be zero.The determinant of the
correlation matrix is the generalized variance of standardized variables.
Trace: is the sum of the diagonal elements of a matrix. Calculated on S is another onedimensional measure of multivariate variability. The total sample variance is the sum of the
variances of each variable, i.e. trace(S). It does not take into account the correlation
structure. On S can also be seen as the sum of its eigenvalues.
Square root: returns the R matrix such that R*R is equal to the selected matrix. Obtaining
R is based on the algorithm of singular value decomposition.
158

Multivariate statistics

G-reverse: returns a generalized inverse of the matrices previously selected. The diagonal
elements of the inverse correlation matrix give an idea of the extent to which the variables
are linear functions of others. These diagonal elements are often called variance inflation
factors. The j-th diagonal element is 1/( 1 - R 2j ) where R 2j is the square of the determination
coefficient of regression between the j-th variable and other variables.A large diagonal
element indicates that the corresponding variable is highly correlated with any of the other
variables. When one or more variables are a linear function of other considered variables,
the correlation matrix is not full range.

Cluster analysis
Menu STATISTICS
MULTIVARIATE ANALYSIS
CLUSTER ANALYSIS, allows to
implement different processes for
grouping objects described by a set
of values of several variables.
Objects usually represent the rows
of the data table. Occasionally,
these procedures are used to group
variables rather than observations
(i.e. conglomerate columns instead
of rows). In InfoStat, Cluster
Analysis window lets you select

the file variables to be used in the analysis and indicate


one or more variables as classification criteria in order to
summarize many records in a single case. Pressing the
OK button another window appears called Cluster
analysis which has three tabs: Hierarchical, Non
hierarchical and Summary statistics. If it has been
defined a criterion for sorting records in Summary
statistics tab, InfoStat allows to choose between
measures position as the mean, median, minimum,
maximum and dispersion such as variance and
standard deviation to summarize the information each
variable in each set of records defined by the criterion of
classification (by default uses the mean).
In the Hierarchical and Non hierarchical tab, you can
choose the method (default is automatically selected
average from hierarchical clustering or K-Means as a
159

Multivariate statistics

non-hierarchical
algorithm),
and type of distance (default
average Euclidean) used in the
formation of clusters. InfoStat
allows to activate the cell to
standardize the data, this
option
automatically
standardizes
each
column
selected as variable before
clustering. The analysis can be
performed by rows, in which
case records are grouped to
form columns or clusters of
variables. When the number of
objects to be classified is large,
the dendrogram can be divided
into pages (Paging dendrogram
option).
Both non-hierarchical clusters as hierarchies, when you are grouping cases (cluster rows) or
variable (cluster columns), by activating the Save classification box, InfoStat generates a
new column in the active data table containing the name group number that was assigned to
each observation. The number of groups should be specified in advance at the Max number
of cluster box. InfoStat automatically provides a graph showing the reduction in the
objective function of clustering in relation to the number of clusters (from two to the
number indicated by the user), identifying the groups formed in different colors. In the case
of non-hierarchical clusters, the recommended number of groups is one that is associated
with a fall over the function with respect to the previous number.
For hierarchical clustering, InfoStat automatically produces the dendrogram corresponding to the
evolution of clustering based on the distance selected. The information displayed in the dendrogram
can be viewed in the Output window.

Example 5: We made a plan for collecting data to analyze morphometric similarities and
differences among 14 genotypes (cultivars) of chickpea. We measured 9 variables such as
length, width and thickness of the sheath among others, for each observation corresponding
to a genotype.There are several observationsby object to be grouped. The data (courtesy of
Julia Carreras, Faculty of Agricultural Sciences, UNC), is found in the Chickpea file.
To perform a hierarchical cluster analysis was chosen: STATISTICS MULTIVARIATE
ANALYSIS CLUSTER ANALYSIS. In the windows Cluster Analysis are specified as
Variables all the measurements and as Class variables the "genotype". In the Hierarchical flap was
chosen the Average linkage method and the Euclidean distance. Standardize data field was
selected and an analysis to cluster Rows was requested. In number of clusters was 4 and to
summarize the observations of the same genotype was used the option selected by default in the flap
Summary statistics (Mean). Dendrogram was obtained as follows:

160

Multivariate statistics
Promedio (Average linkage)
Distancia: (Euclidea)
555
156
202
70
337
240
521
517
522
336
67
507
75
41
0.00

1.75

3.50

5.25

7.00

Distancia

Figure 25: Dendrogram. Chickpea.IDB2 file.


In this example, setting an arbitrary threshold criterion in the distance 3.5, the genotype 555
is separated from the rest. Genotypes 156, 202 and 70 form a group, genotypes 337 and 240
other group and the other genotypes form another group. Is a criterion frequently used
reference to draw the line at a distance equal to 50% of the maximum distance (in this case
the maximum distance is around 7, so the point was plotted in 3.5).

Theoretical notions about the cluster analysis


Multivariate object clustring method is often used as exploratory data in order to obtain
more knowledge about the structure of the observations and/or variables under study. While
it is true that the clustering process initially involves loss of information because they are
located in the same class not identical units (only similar), the synthesis of available
information on the units concerned can significantly facilitate the visualization complex
nature of multivariate relationships. It uses clustering techniques when there is no known
cluster structure of the data "a priori" and operational objective is to identify the natural
grouping of the observations. Classification techniques based on groupings involving the
distribution of units of study in classes or categories so that each class (cluster) put together
units whose similarity is maximum under some criterion. I.e. objects in the same group
share the largest permissible number of features and objects in different groups tend to be
different.
For grouping objects (cases or variables) are required some algorithm. The word algorithm
describes a systematic set of operating rules that allow the realization of a type of tasks step
by step to get a result. The algorithms or clustering methods allow to identify existing
classes in relation to a given set of attributes or characteristics. In different areas of
knowledge are these algorithms under different names such as automatic classification,
161

Multivariate statistics

typological analysis (from French "analyze typologique"), cluster analysis (from English
"cluster analysis"), numerical taxonomy, etc. Classification algorithms can be divided into
non-hierarchical and hierarchical. In the non-hierarchical classification techniques is
desired single decomposition or partition of the original objects set based on the
optimization of an objective function. While in the hierarchical classification techniques,
and to find possible nested partitions, i.e. consecutively finer (or coarser), then the objects
are jointed (or apart) in groups step by step. In biology (taxonomy) are traditional
hierarchical techniques because they translate best to the complexity of organizing living
things and the existence of different developmental levels. In other applications, nonhierarchical techniques can provide an appropriate description of the data, such as
classification of books in a library.
Clustering algorithms can be supervised or unsupervised depending on whether the number
of classes to be obtained is set "a priori" by the person conducting the experiment or
whether it results from the application of the technique of classification. Often, preliminary
information available or the results of pilot experiments, can guide the experimenter or user
in selecting the number of classes. Other times, they know a maximum value for the number
of classes, and then the algorithms is implemented by specifying the value and then, in
relation to the obtained results, re-make clusters. Hierarchical classification techniques are
generally unsupervised type.
The clustering achieved depend not only of chosen clustering algorithm but also on the
selected distance measure, the number of groups to be formed (if this information exists),
the selection of variables for analysis and scaling of the same. Traditional texts that address
the problems associated with the formation of clusters are those of Anderberg (1973) and
Everitt (1974).
Cluster analysis of cases or individual records come from a matrix of np (say p
measurements or variables in each of the n objects studied), which is then transformed into a
distance matrix (nn) where the element i,j-th measure the distance between pairs of objects
i and j for i,j=1,...,n. The matrix elements are functions of distance metric or non-metric
distances. In the cluster analysis of variables we will use a distance matrix (pp) where the
element i, j-th measure the distance between pairs of variables i and j for i,j=1,..., p.
When you have many variables for clustering, it is commonly used (before cluster analysis)
dimension reduction techniques such as Principal Components Analysis to obtain a smaller
number of variables capable of expressing the variability in the data. This technique may
facilitate the interpretation of the clusters obtained.
In practice, we recommend applying several clustering algorithms and selection or
combination of variables for each data set. Selecting, finally, from the groups made the most
appropriate interpretation. InfoStat automatically provides the value of the cophenetic
correlation coefficient which can be used to select one of several alternative groupings. This
correlation coefficient indicates the distance by binary tree metric with the original distances
between objects, then expect more clustering coefficient is the one that best describes the
natural grouping of data.

162

Multivariate statistics

It is important to remark that clustering procedures produce successful results when the data
matrix has a structure that can be interpreted from the problem resulting in the collection of
information. Therefore, groups achieved it is important to characterize them through various
summary measures to ease the interpretation of the final grouping.

Hierarchical clustering methods


Hierarchical methods produce clustering such a cluster can be contained entirely within
another, but is not permitted other types of overlap between them. Hierarchical clustering
algorithms used for clustering can be agglomerative or divisive (using successive mergers or
divisions to group objects).
Agglomerative methods performed procedure groups by the successive unions. In the
beginning there are many groups as objects. The similar objects are first grouping and these
initial groups are then joined according to their similarities. Divisive hierarchical methods
begin by assuming that all objects belong to the same group to which partitioned into finer
and finer subdivisions, to the point where each object is considered a cluster of unit
size. InfoStat works with agglomerative methods as these are more satisfactory with respect
to computation time.
Hierarchical clustering results are shown in a dendrogram (tree diagram in two dimensions),
which you can see the joints and/or divisions that are made at each level of cluster building
process (see Figure 25). The branches in the tree represent the clusters. Branches meet at a
node whose position along the distance axis indicates the level at which fusion occurs. The
node where all the entities form a single cluster is called the root node. Because each level is
evaluated in the union of two observations (or two clusters), these dendrograms are known
as binary trees. In practice, the focus tends to be focused on intermediate outcomes where
the objects are classified in a moderate number of clusters.
One of the main characteristics of the agglomerative hierarchical clustering procedure is that
the location of an object in a group does not change, that is, that once an object is placed in a
cluster, it is not relocated. This object can be merged with others belonging to another
cluster to form a third party that includes both.
Agglomerative algorithms proceed as follows: initially, each object belongs to a different
cluster, in the next step is merging the two closest objects to form the first cluster, in the
third stage, a new object is added to the cluster formed the first stage or two objects merge
to form a second cluster. The process continues similarly until, eventually forming a single
cluster containing all objects as staff members. Hierarchical clustering techniques differ in
the mapping rule objects to a cluster or clusters that use fusion. Listed below are hierarchical
clustering algorithms available in InfoStat:
Single linkage (Florek et al. 1951st, 1951b): Groups join based on the distance between the two
closest members. This method is also known by the name of the nearest neighbor procedure uses the
concept of minimum distance and starts looking for the two objects that minimized. They constitute
the first cluster. In the following steps are necessary as has been explained in the previous section, but
starting from n-1 objects where one of them is the conglomerate formed earlier. The distance between
clusters is defined as the distance between their closest members.
163

Multivariate statistics

Example: apply the simple technique of chaining from the following matrix of distances
between five individuals:
A B C D E
A 0
B 1
D1 = C 5

D 6
E 8

3 0

8 4 0
7 11 2 0

where the i-th row and j-th column gives the distance dij between individuals i and j .
In the first stage A and B are coupled, since d(AB)=1.0 is the smallest element in D1. Once
formed this cluster, calculating the distances between the cluster and the remaining
individuals C, D and E. D1 are obtained as follows:
d(AB, C)= min {d(AC), d(BC)} = d(BC) = 3.0
d(AB, D)= min {d(AD), d(BD)} = d(AD) = 6.0
d(AB, E)= min {d(AE), d(BE)} = d(BE )= 7.0
Should form the distance matrix D2, whose elements are distances between individuals and
distances and between individuals and group
(AB) C D E
(AB) 0

C 3 0
D2 =
D 6 4 0

E 7 11 2 0

The lower input of D2 is d(DE)=2.0. Therefore the decision is to bring together individuals
D and E, forming a new group, not adding any other conglomerate already formed.
In the third stage, we calculated the following distances:
d(AB, C) = 3.0
d(AB, DE) = min {d(AD), d(AE), d(BD), d(BE)} = d(AD) = 6.0
d(DE, C) = min {d(CD), d(CE)} = d(CD) = 4.0
Then, D3 results:
(AB) 0

D3 = C 3 0
(DE) 6 4 0

The smallest element in D3 is d(AB, C). This indicates that the individual C should be
grouped in the first cluster (with A and B). In the last step, the two groups fuse into a single
cluster, which contains all individuals.
164

Multivariate statistics

Since the simple chaining procedure based on cluster joins the minimum distance between
its elements, the procedure may have problems when there are groups with some very close
or overlapping. The simple linkage procedure is one of the few screening procedures that
have a good performance with no elliptical cluster configurations (string data). This method
is recommended to detect irregular structures and elongated cluster. It tends to separate the
ends of the distribution before separating the main groups (Hartigan, 1981). Therefore,
it tends to produce clusters of chain groups.
Complete linkage (Sorensen, 1948): The distance between clusters is the most distant pair
of objects. This method also known as furthest neighbor (farthest neighbor) is similar to
above, but the distances are now defined as the distance between pairs of more distant
individuals. To illustrate the complete linkage procedure working with the matrix of
distances D1 from which developed the technique of simple chain. First merge individuals
A and B as in the simple chain, however the distances between this cluster and the three
remaining individuals D1 is obtained as follows:
d(AB, C) = max {d(AC), d(BC)} = d(AC) = 5.0
d(AB, D) = max {d(AD), d(BD)}= d(BD) = 8.0
d(AB, E) = max {d(AE), d(BE)}= d(AE) = 8.0
This is:
AB 0

C 5 0
D2 =

D 8 4 0

E 8 11 2 0

Therefore, individuals D and E are joined in a new cluster. By similar calculations we


obtain:
AB 0

D3 = C 5 0
DE 8 11 0

Then, C joins the cluster AB and finally both clusters are joined.
It can be shown that this method is identical to the method known as "minimal tree"
(Anderberg, 1973). The algorithm to calculate the distance between clusters
corresponding to the division of the tree formed by joining in the first instance, the two
objects with less separation, then the one with the next smallest gaps, etc., and finally
joining those with the highest separation. Tends to produce groups of equal diameter and is
very resistant to outliers (Milligan, 1980).
Average linkage or UPGMA (unweighted pair-group method using an arithmetic
average) (Sokal and Michener, 1958): In this method the distance between two clusters is
obtained by averaging of all distances between pairs of objects, where a member of pair
165

Multivariate statistics

belongs to one of the clusters and one member to the second cluster. This is one of the most
simple methods and it has been found most successful in many applications. Various
expressions have been proposed to calculate the average distance, one of them is:

d ( AB )C =

d
i

ij

n( AB ) nC

where dij is the distance between the object i, which belongs to cluster AB and object j
belonging to the cluster C, the sum over all possible pairs of objects between two clusters,
where n(AB) and nC are the numbers objects in clusters B and C respectively. The method
tends to produce groups of equal variance (Milligan, 1980).
Weighted average linkage, WPGMA: also known as McQuitty method was introduced
independently by Sokal and Michener (1958) and McQuitty (1966). It represents a
generalization of the previous procedure using the number of objects in each cluster as a
weight. That also means the distance is based on a weighted average. If the weights are
equal this method gives the same results as the previous method.
Unweighted centroid, UPGMC (Sokal and Michener, 1958): taking the average of all
objects in a cluster (centroid) to represent the cluster and measure distances between objects
or clusters with respect to the centroid. Agglomerative procedure is more robust to outliers
(Milligan, 1980).
Weighted centroid: is a generalization of the previous procedure by weighting the distances
by the number of objects in each cluster involved in the calculation as weight. The centroidbased methods assume a distance matrix based on Euclidean metric. If the weights are equal
this method gives the same results as the previous method.
Ward or minimum variance method (Ward, 1963): Similar to the centroid method, but
when linking clusters do a weight (by the size of each group) of all cluster participants, and
at each joint, data loss is minimized. It defines the distance between two groups as the
ANOVA sum of the sums of squares between the two groups on all variables. The method is
recommended for data with normal distribution and spherical covariance matrices,
homogeneous between groups. It tends to produce groups of equal number of observations
and may be severely affected by outliers (Milligan, 1980).
Hierarchical processes described above make no differential action with outliers. If a rare
observation was classified in early stages of procedure in any group, it will remain there in
the final configuration. It is therefore important to carefully review the final configuration.
The practice of applying more than one procedure and over a distance measurement usually
helps to differentiate between natural and artificial groupings. Some experimenters, use the
technique of the disturbance (introduction of data errors and consolidation under the new
situation) to test the stability of hierarchical classification. Repeated sampling technique
known as bootstrap test is also recommended for stability of the nodes reached in a
particular grouping.

166

Multivariate statistics

Non-hierarchical clustering methods


InfoStat allows grouping of objects by the non-hierarchical process K-means. The algorithm groups
objects into k groups by maximum variation between clusters and minimizing the variation within
each cluster. This method begins with an initial cluster or a group of seed points (centroids) that form
the centers of the groups (initial partition of objects into k group items). Proceeds by assigning each
object to the group that has the centroid (mean) closer. The distance commonly used is the Euclidean
in both standardized and in non-standardized observations. It works by minimizing the objective
function "sum of squared distances". The partition is achieved such that the sum of the sum of the
squared distances of the group members about their centroid is minimal. The method is based on the
principle of the best k centroids. The centroids are modified each time an object is transferred from
one group to another. The K-means algorithm is optimal in each step. Final results may depend on the
initial configuration of the sequence in which objects are seen clearly grouped and the number of
groups. For the purpose of reaching a global optimum, it is advisable to use several initial partitions
and select that final partition with minimum objective function value. InfoStat automatically reports
the values of the function under the name SWSS (Sum of Within Squares Sums). The successive
application of non-hierarchical and hierarchical procedures is a recommended strategy to determine
the number of groups appropriate for the problem at hand. It is advisable to apply in first instance an
agglomerative hierarchical method that suggests a number of groups (groups to establish a threshold
criterion such as 55% of the maximum distance) and then use this information as initial partition of
K-means algorithm.

Distances
Cluster analysis requires measuring the similarity between the group entities. InfoStat works
with dissimilarity or distance measures. The selection of an appropriate distance measure
depends on the nature of the variables (qualitative, quantitative) of the measurement scale
(nominal, ordinal, interval, ratio) and knowledge of the object of study. All distance
functions discussed in this document may be used with any process of formation of clusters.
For data with metric properties (continuous, interval scale and/or ratios) can be used as
measures of Manhattan or Euclidean distance while for qualitative or discrete attributes are
more appropriate distance measures based on similarity or association. Different functions
can be used in InfoStat to obtain distances from similarity measures. When choosing a
similarity measure in cluster analysis window, it automatically gives the right of that
window another window where you can choose that role.
For the grouping of variables are recommended distance measurements based on correlation
coefficients. All distance measures that can be used in this module are described in the
module of menu STATISTICS MULTIVARIATE ANALYSIS DISTANCES
ASSOCIATIONS.

Principal components
Menu STATISTICS MULTIVARIATE ANALYSIS PRINCIPAL COMPONENTS allows
analyzing the interdependence of metric variables and finding a best graphical representation of the
variability of the data in a table of n observations and p columns or variables. The principal

167

Multivariate statistics
component analysis (PCA) is to find, with minimal loss of information, a new set of variables
(components) not correlated to explain the structure of variation in the ranks of the data table.

In the window Principal component analysis, should indicate the response variables and
classification if there is any. In case of a classification point InfoStat works with the axp
data matrix where a is the number of classification levels and p the number of selected
variables. On the General tab are options to keep the components obtained (Save axes) as
the number of components indicated or as the criterion for automatic selection of the
number of axes to keep. When the user activates the # automatic InfoStat will save many
axes as eigenvalues greater than average value of the eigenvalues has. If principal
components are saved as new columns will be added to the active table. These components
can subsequently be used for scatter plots of the observations (the scatter plot using as axes,
the CP1 and CP2 can display the highest variability among observations.) If you make
several PCA will generate many new columns as components decide to save in each
analysis. To avoid this accumulation of new columns you can enable Overwrite, and only
saved the last of the PCA. You can ask the standardization of each variable before to start
the analysis (Standardize data), the display of the covariance or correlation matrix (Show
covariance/correlation matrix) on which the analysis is performed, the correlation with
original variables, the cophenetic correlation coefficient, Biplot graphics and minimum
travel tree (MST). If this has indicated a classification criterion Summary measures tab,
InfoStat allows to choose between position measures as the mean, median, minimum,
maximum and dispersion such as variance and standard deviation as statistical
information to summarize each variable in each set of records indexed by the criteria
(optional).
Example 6: In a study that aimed to study the foods used as protein sources in diets of
people in European countries, food consumed were recorded. Data are found in Protein file.
Menu STATISTICS MULTIVARIATE ANALYSIS PRINCIPAL
COMPONENTS. In the Principal Components Analysis window select "Beef", "Pork"
and other variables that are the source of protein as Variables and "Country" as
Classification variables. In the principal component analysis window was activated Save
axes and the number 2 was introduced to preserve the first two components. Also was
activated Standardize data for the analysis of the correlation matrix instead of the
covariance matrix of variables. Biplot option was activated from which we obtained the
following graph:

168

Multivariate statistics
Portugal

4.50

Fish
Fruits_vegetables

2.92

Spain

PC 2

Sausages
1.35
France

Norw ay

Dry_fruits

Poland
Denmark Belgium

-0.23

GermanyW
Ireland

Italy

GermanyE

Eggs

Russia

Beef

Sw eden
Netherlands Finland
Milk

Hungary

Czechoslovakia

Cereal

Sw itzerland
Pork

Austria

-1.80
-2.79

-1.40

0.00

1.39

2.78

PC 1

Figure 26: Biplot. File: Proteins.idb2.


As you can see the first component (PC1) separates cereal and nuts from other protein
sources, so the greater variability between the consumption habits of different countries is
explained by these variables. Albania, Bulgaria, Yugoslavia and Romania are more
associated with the consumption of cereal, Greece and Italy to nuts, Spain and Portugal
consume more fruits and vegetables; Norwegian fish consumption, East Germany, France,
Denmark, England, Sweden and Belgium is associated with the consumption of beef and
some of these countries also eggs. Germany, Switzerland and Finland are more associated
with the consumption of milk and eggs. The Netherlands, Ireland and Austria are associated
with consumption of pork and milk, while Czechoslovakia it is for pork. Poland is not
associated with any particular protein source. These two axes explained 65% of the total
variability in the observations (see Table 52). To further explore these relationships could
have asked for a third component that keeps asking InfoStat save 3 axes. In this case
InfoStat reports all the Biplot graphics that can be built from three components saved.
InfoStat automatically provides the eigenvalues and eigenvectors resulting from principal component
analysis on correlation matrix (or covariance as requested) of the variables, which are displayed in the
Output window. For the example presented these values can be seen in Table 52. In this table you
can see the eigenvalues associated with each eigenvector (always shown many eigenvectors as
principal components are selected for analysis), the proportion of total variability explained by each
component (eigenvalues) and the proportion of total variability explained, on a cumulative form, is
presented in the table named Eigenvalues.

169

Multivariate statistics

The results of this example, note that the first two components can explain 65% of the total
variation. The eigenvectors (e1 and e2) reported show coefficients that each original variable
was weighted to conform the CP1 and CP2. In this example, you can see that, in
constructing the CP1, variables "Cereal" and "Dry_fruits" receive the highest negative
loadings and the variable "Eggs" highest positive weight. The variables "Milk", "Beef" and
"Pork" weights also have relatively high positive coefficients. Then you can interpret the
CP1 oppose countries that use grains and nuts as main sources of protein for those who
primarily use eggs and other animal products as a protein source. Similarly you can read the
remaining eigenvectors retained to explain the meaning of each component.
In this example, after explaining the variability in feeding habits of the countries due to the intake of
cereals and nuts versus animal products of the type mentioned, it should emphasize the variability
introduced by the consumption or not of fish and fruits and vegetables (CP2). The orthogonality of
the principal components ensures that the CP2 provides new information on variability compared to
that provided by the CP1, i.e. explains variability in dietary habits between countries is not explained
by PC1.
Table 52: Principal components analysis: eigenvalues and eigenvectors. File Proteins.
Principal component analysis
Eigenvalues
Lambda Valor Proportion
Cum. Prop.
1 3.72
0.47
0.47
2 1.50
0.19
0.65
3 1.09
0.14
0.79
4 0.85
0.11
0.90
5 0.33
0.04
0.94
6 0.28
0.03
0.97
7 0.12
0.02
0.99
8 0.10
0.01
1.00
Eigenvectors
Variables
Beef
Pork
Eggs
Milk
Fish
Cereal
Dry_fruits
Fruits vegetables

e1
0.33
0.33
0.44
0.41
0.11
-0.44
-0.44
-0.14

e2
0.10
-0.29
0.02
-0.07
0.71
-0.32
0.14
0.53

The main components generated by InfoStat are linear combination of original variables
previously standardized when the analysis is applied to the correlation matrix. When the
analysis is applied to the covariance matrix, InfoStat reports linear combinations of
variables focused only on its mean.
If in addition to requesting Biplot graph in the above example is enabled MST option get the
following graph:

170

Multivariate statistics
Portugal

4.50

Fish
2.92

Fruits_vegetables
Spain

PC 2

Sausages
1.35
France Norw ay

Dry_fruits

Poland
Denmark
Eggs
-0.23

-1.80
-3.80

Russia

Sw eden

GermanyW Beef Sw itzerland


Milk
Czechoslovakia
Ireland
Pork
Netherlands

-1.85

Italy

Cereal

Hungary

0.10

Romania
Bulgary
2.05

Yugoslavia
Albania
4.00

PC 1

Figure 27: Biplot and minumum spanning tree. File Proteins.


Minimum path trees (see MST) link observation points according to the distance between
them calculated in the original space, i.e. one of many dimensions as variables involved in
the study. The distance between two points in the plane (space reduced from the original
reduction) may not accurately reflect the true distance structure, i.e. that larger space. In this
example, the option MST allows better visualization of the associations between countries
according to their source of protein. For example, Hungary is closer to the Russian chart of
Romania, but Hungary's eating habits become more like those in Romania than in Russia as
it unites the tree first.

Theoretical notions about the principal component analysis


The principal component analysis and biplot graphs are known as techniques commonly
used for dimension reduction. The dimension reduction techniques allow you to examine all
data in a space smaller than the original space of variables. With the PCA are constructed
artificial axes (principal components) that can provide scatter plots of observations and/or
variables with optimal properties for the interpretation of the underlying variability and
covariates. The biplots allows to visualize observations and variables in the same space, so it
is possible to identify associations between observations, between variables and between
variables and observations.
Differences in data generated variability, and then one way to summarize and organize data is
through analysis or explanation of the variance and covariance structure of all study variables. The
PCA is a technique often used to organize and represent continuous multivariate data through a set of
171

Multivariate statistics
d=1,..., p normalized orthogonal linear combinations of original variables that explain the variability
in the data so that no another set of linear combinations of the same cardinality, variance
combinations is higher than the set of principal components. Usually select a number d much smaller
than p, to represent the underlying variability. It is expected that the reduction of dimensionality no
produce significant loss of information. From this point of view, the technique of dimension
reduction involves a consistent help in the interpretation of data. The first component contains more
information (variability) than the second one, it more than the third and so on until you explain more
variability.
The PCA to sort observations is based on the spectral decomposition of the covariance or correlation
matrix between variables of size pxp. The choice between the unbiased estimator and maximumlikelihood estimator of the population covariance matrix is irrelevant, since it produces the same
sample principal components. Using the eigenvectors of S and R as vectors for the linear combination
coefficients can be shown that the principal components are uncorrelated linear combinations whose
variances are maximum.

The j-th principal component (PCj) is algebraically a linear combination of the p original
variables obtained as Y=
e j X
= e1 j X 1 + e2 j X 2 + ... + e pj X p with j=1,..., p where ej represents
j
the j-th eigenvector. The new variables, PC, use information contained in each of the
original variables, some variables may contribute more to the other linear combination. The
variance of the j-th principal component is where the j-th is the eigenvalue associated with
the j-th eigenvector of S, (the eigenvalues are sorted in decreasing order, 1> 2> 3 ....).
Also it is satisfied that any two components, the covariance is zero. The proportion of total
variance explained by the first d components will be:
Prop d =

1 + 2 + ... + d
1 + 2 + ... + p

The coefficients of each original variable standardized to a PC, to identify the variables that
contributed most in explaining the variability among observations on the axis associated
with the corresponding PC. To analyze association between variables with components can
be ordered correlations between principal components and original variables. These are
given by:
r (Y j , X k ) =

ekj j

k2

and also represent an indicator of how important a particular variable in the construction of
the component. The interpretation of this correlation may be more reliable than the
interpretation of the coefficients that form the eigenvectors, since the correlation takes into
account differences in the variances of the original variables and thus eliminates the bias
caused by different interpretations scales.
The data to analyze may or may not be pre-focused and/or scaled resulting in different types of PCA.
The PCA from the correlation matrix (covariance matrix of the original variables centered and scaled)
data is useful when the units of measure and/or the variances of the variables are different. Otherwise
the variables with higher variance (not necessarily more informative) have too much influence in
determining the solution. The principal components obtained using the correlation matrix can be
substantially different from those obtained using the covariance matrix. In each case analysis will be
172

Multivariate statistics
judged more convenient. When variables do not have similar variances or non-action on the same
scale (measurable variables), we recommend obtaining the components from the correlation matrix.
The cophenetic correlation coefficient under the PCA reported by InfoStat, calculates the correlation
between Euclidean distances in the reduced space and the same distance in the original space of
dimension given by the number of original variables. Then, this value can be used as a measure of the
quality of reduction achieved.

Biplot
The scatter plots constructed from the principal components can be used to visualize the
dispersion of the observations but the influence of the variables is not explicit in such
diagrams. Biplots graphics proposed by Gabriel (Gabriel, 1971), show the observations and
the variables in the same graph, so you can make interpretations of the joint relations
between observations and variables. The prefix "bi" in the name biplot reflects this
characteristic; both observations and variables are represented in the same graph.
In the biplots, InfoStat graphs observations as blue dots. The configuration of the points is
obtained from a PCA. The variables are plotted as vectors from the origin (with terminations
in yellow circles). In the biplots constructed, the distance between symbols representing
observations and symbols representing variables has no interpretation, but the addresses of
symbols from the source itself can be interpreted. The observations (dots rows) are plotted
in the same direction as a variable (dot column) may have relatively high values for that
variable and low value of variables or columns that are plotted points in the opposite
direction. On the other hand, the angles between the vectors representing the variables can
be interpreted in terms of correlations between variables. 90 angles between two variables
indicate that both variables are not correlated. Departures from this value (either minor or
major values at 90) mean correlation (positive or negative). I.e. an angle close to zero
implies that both variables are strongly positively correlated and a flat angle close to the
angle between two variables indicates that both show strong negative correlation. When the
lengths of the vectors are similar figure suggests similar contributions of each variable on
the representations made.
In InfoStat, Biplot option can be selected to represent the results of the PCA. The graph is
obtained by making a scatter plot of observations from the PCA on the correlation matrix
(covariances) of variables and superimposing the eigenvectors that represent the variables
appropriately scaled in the same space.

Minimum spanning Tree (MST)


Travel trees are constructed by joining points representing multivariate observations and
projected on a plane as a result of any dimension reduction technique. The points are
connected with straight line segments such that all points are linked directly or indirectly
and there are no loops (Gower and Ross, 1969). The minimum path tree is a tree path with
segments connected so that the sum of the lengths of all segments is minimal. Within the
PCA the MST is calculated by taking into account the distance of the points of the matrix
rows of data in the original space (whose dimension is equal to the number of variables
under study).

173

Multivariate statistics

Discriminant analysis
Menu STATISTICS MULTIVARIATE ANALYSIS DISCRIMINANT
ANALYSIS allows multivariate discriminant analysis (DA) to canonical metric variables.
Discriminant analysis is useful for: 1) discriminate on the basis of selected variables a priori
defined groups and to represent the observations in an area where differences between
groups are maximal and 2) to classify new cases in the groups established a priori the basis
In the Discriminant analysis window, declare the Variables that conforms the discriminant
function and the variables that define the Grouping variables creterion. InfoStat expectes
there is at least more than one observation per group. For the analysis can be conducted, the
number of individuals per group should exceed the number of variables or at least equal.
When OK appears another window where you can order: the between groups covariances
matrix, the between groups sums of squares matrix, the residual covariance matrix
(pooled), the residual sums of squares matrix (pooled), the Univariate analyses of
variance, show mis classification rates, Save first discriminant axis (next to this option
you can choose the number of coordinates to keep), Overwrite discriminant coordinates,
graph the first two discriminant coordinates, and MST and Biplot graphs or minimum
travel trees.
The choices Between groups covariance matrix and Residual covariance matrix
(pooled) allows to visualize the covariance matrices related to the hypothesis of group
effects (H) and residual variances and covariances obtained after discounting the effect of
groups (E). You can also require the matrices of sums of squares that give rise to matrices H
and E. InfoStat yields the univariate analysis of variance, i.e. statistical analysis based on F
constructed from the diagonal elements of H and E matrices for each variable.
Activating Show miss classification rates, InfoStat provides the apparent error rates
(estimates of the probability of miss classification) obtained by classifying the observations
in the groups file in question from the discriminant function using newly constructed. It is
common in such problems, when you have enough data, partitioning the data set into two
subsets, one is used to find the discriminant function and another to validate it. The function
estimated from the first file (calibration data) can be evaluated with data from the second
file (validation data). Apparent error rates use a single file for both processes, i.e. the same
observations used to estimate the function are then reclassified to the function to estimate
the classification error. Apparent error rates tend to underestimate the error, are useful when
you have large sample size in each population (Johnson and Wichern, 1998). Selecting Save
first discriminat axis is generated in the data table so many new columns as indicated (to
specify the maximum number equals the number of groups minus 1 or the number of
variables whichever is less).
InfoStat automatically does Barttlet test for the hypothesis of homogeneity of covariance
matrices (Morrison, 1976), such testing is useful to define whether it is optimal to work
with a linear discriminant function or whether it should use another type of discriminant
function. The rejection of the null hypothesis test (assuming equal covariance matrices for
each group) suggests when the data are normal, a quadratic discriminant function would be
more appropriate than the linear discriminant function. Also automatically reports the
maximum number of canonical discriminant functions (except when required to keep a
174

Multivariate statistics

smaller number), the standardized discriminant function for the common residual
covariance matrix and the centroids in discriminant space, i.e. the average values of the
canonical functions for each group. These centroids are useful for later classification of new
observations in the groups. The construction of canonical functions is performed according
to the theoretical notions explicit later.
Example 7: Used Iris data file containing 50 observations of 4 features a flower: sepal
length ("SepalLen"), sepal width ("SepalWid"), petal length ("petalled") and petal width
("PetalWid") for 3 species of Iris (Fisher, 1936). Total observations: 150. It aims to find a
discriminant function for classifying new flowers in one of three groups (species), according
to the value assumed for the flowers, the four variables that make up the discriminant
function.
After selecting the DL, in the Linear Discriminant Analysis window, pleaded SepalLen,
SepalWid, PetalLen and PetalWid as Variables and species as Grouping variables.
To OK another dialog box appears where the default requested: Show miss classification
rates, Save first discriminant axis (2, equal to the number of groups minus 1) and
Overwrite discriminant coordinates (if the file already exists a column containing
canonical axes (discriminant coordinates) 1 and 2, InfoStat overwrite these columns with the
axes obtained in the present analysis.) Then choose Graph. When Go is obtained the
following results:
Table 53: Linear discriminant analysisl. File Iris.
Discriminant analysis
Lineal discriminant analysis
Equal within-group covariance matrix test
Groups N
Statistic
df
p-value
3 150
439.17
20
<0.0001
Eigenvalues of Inv(E)H
Eigenvalues
%
Cumulative %
32.19
99.12
99.12
0.29
0.88
100.00
Canonical discriminant functions
1
2
Constant
-2.11 -6.66
SepalLen
-0.83
0.02
SepalWid
-1.53
2.16
PetalLen
2.20 -0.93
PetalWid
2.81
2.84
Canonical discriminant functions - data standardized by within variances
1
2
SepalLen
-0.43
0.01
SepalWid
-0.52
0.74
PetalLen
0.95 -0.40
PetalWid
0.58
0.58
Canonical scores of group means
Group Axis 1 Axis 2
1
-7.61
0.22
2
1.83 -0.73
3
5.78
0.51

175

Multivariate statistics
Classification table
Group 1
2
3
1
50
0
0
2
0
48
2
3
0
1
49
Total 50
49
51

Total
50
50
50
150

Error(%)
0.00
4.00
2.00
2.00

The test of homogeneity of covariance matrices showed a p value p<0.001, suggesting that
this assumption is not fulfilled and that a quadratic discriminant function could be better.
However, it continued with the analysis because this dataset has been widely used in the
literature to illustrate the results of linear DA. From the eigenvalues of inv(E)H expression,
we can conclude that the canonical axis 1 explains 99.12% of the variation between groups.
Since three groups produced two discriminant functions, or two canonical axes, the value of
each observation on each canonical axis are added to the data table. The first canonical
discriminant function can be expressed as follows:
F=-2.11-0.83(SepalLen)-1.53(SepalWid)+2.20(PetalLen)+2.81(PetalWid)

In this linear function of the four selected variables, the coefficients correspond to the
distributions of each variable. If the variables have very different variances and/or there is
high covariance between pairs of variables, the interpretation may be misleading, so we
should analyze the relative importance of each variable in discrimination of the groups,
using standardized function coefficients variances and covariance. From the first
standardized discriminant function by common covariances shows that PetalLen is the
most important variable for discrimination on this axis. Observations (flowers) with high
values for this variable (lenght petals) appear to the right of the scatter plot of observations
in discriminant space (space formed by the canonical axes) as the coefficient is positive
(0.95).
Centroids in discriminant space or means of the functions by group, showed that Group 1 is
opposed to the other two groups on the canonical axis 1, indicating that differences in PetalLen
disaggregate observations in group 1 (shorter length of petals) compared to those in groups 2 and 3.
Similarly you can interpret differences between groups using the canonical axis 2. In this example the
axis 2 explains very little variation between groups (associated with eigenvalue shows that the
percentage of variance explained on this axis is 0.88%). Thus it should be noted the relative
importance of the canonical axes.
The cross-classification table presented at the end of the output (in rows represents the group to
which the observation and in column the group to which is assigned the same point by using the
discriminant function) said that the 50 facilities of the Group 1 were all well classified, the
classification error rate in this group is 0%. Of the 50 individuals in Group 2 and 48 were assigned to
either two were wrongly classified in Group 3, the error rate is 4%. Interpretation is similar to Group
3. The average apparent error rate is 2%. InfoStat automatically adds to the data table column called
"Classification", which show that cases 71, 84 and 134 were those misclassified.
To display the discrimination between groups suggested by DA was selected Graphs in the DA
window. This option automatically produces a scatter diagram with canonical axis 1 and the
canonical axis 2, partitioned by the criterion of classification, in this case "specie". Were added to the
chart prediction ellipses, which are achieved as follows: select the three series, press the right button
and choose "draw outlines" that enabled the menu "options of contour, which are: "Contour simple",
"ellipse prediction" and "ellipse of confidence". It also marked the three misclassified observations.
176

Multivariate statistics

Canonical axis 2

-3.83

-2.08

-0.32

1.43

3.18
-10.06

-5.18

-0.31

4.56

9.44

Canonical axis 1

Figure 28: Graphical representation of multivariate observations of three groups in the


discriminant space of canonical axis 1 and axis 2. Countourns correspond to predicton
ellipses. File Iris.

3.18

Canonical axis 1

1.43

-0.33

-2.08

-3.83
-10.06

-5.18

-0.31

4.56

9.44

Canonical axis 1

Figure 29: Confidence ellipses for the centroid of multivariate observations. Vertical lines
indicate the centroid of each group un canonical axis 1. File Iris.

Theoretical notions on discriminant analysis


The DA allows to describe algebraically the relationship between two or more populations (groups)
so that the differences between them are maximized or become more evident. The DA is often
performed for predictive purposes related to the classification, one of the existing populations, new
observations or observations which are not known to which group they belong. A new observation,
which was not used to build the classification rule, be assigned to the group which is more likely to
belong based on their characteristics measured. For this assignment is necessary to define a
177

Multivariate statistics
classification rule. The linear discriminant function can be used for this purpose. Furthermore, the
DA can be used with the aim of finding the subset of variables that best explains the variability
between groups.
The analysis assumes that you have n independent p-dimensional observations, which are grouped
into two or more groups. The DA approached by InfoStat assumes that the dependent variable is
nominal (as groups) and the independent variables are metric (continuous variables, interval or ratio
scale), i.e. the variable which acts as a grouping factor and each individual or object located table data
in a defined group. This grouping is known a priori to analysis. For example if you are studying the
extent of a viral attack that affects the residents of an area, and identifies three levels of attack such
that 0 indicates "low", 1 "medium" and 2 "high", each individual a sample of people must belong to
only one of these groups. Also assume that there is a number p of variables with potential to explain
the classification or grouping of people according to the degree of attack experienced.
Linear discriminant analysis can be interpreted in analogy with the analysis of univariate linear
regression. The objective of regression analysis is to predict the value of the population mean of a
dependent variable, based on a linear combination of independent variables which are known to
assume values in the individuals of a sample. The objective of discriminant analysis is to find a linear
combination of independent variables that minimizes the probability of misclassifying individuals or
objects in their respective groups. According to the assumptions, in the regression analysis the
dependent variable is assumed normally distributed and independent are fixed. In discriminant
analysis the independent variables are generally assumed that normally distributed and the dependent
(grouping variable) is fixed.
The discriminant function calculated by InfoStat is a linear combination of original variables, in
which the sum of squared differences between groups for that combination, on the within-group
variance is maximum. When there are two groups that generates a single equation linear discriminant
(canonical axis). If there are k groups, k-1 will be uncorrelated discriminant functions (canonical
axes). We recommend plotting the comments in the space generated by the canonical axes for better
visualization of differences between groups.
When the data assumes a multivariate normal distribution analysis by parametric methods can be
performed either by assuming homogeneity of variance and covariance structure and thus working
with the common covariance matrix (in this case is linear discriminant function) or working from
covariance matrices of each group (quadratic discriminant function). The violation of the normality
assumption can be solved by transforming the variables to achieve normality or nonparametric by a
DA. Simulation studies show that the linear discriminant function is robust to departures from
multivariate normality assumption. The lack of compliance with the assumption of homogeneity of
covariance matrices when using a linear function can increase the classification error.

The first approach to linear discriminant for k=2 groups was suggested by Fisher (1936)
who addressed the issue from a univariate perspective using a linear combination of the
features observed. Is x the vector of px1 measured characteristics on one element of a
population and consider two populations 1 and 2 . Call f1 (x) and f 2 (x) to the multivariate

density functions associated with populations 1 and 2 , respectively. We assume that the
random variables characterized by these functions are multivariate mean vectors
1 = E (x | 1 )
2 = E (x | 2 )
and
and
common
covariance
matrix

1 = 2 = = E (x i )(x i ) ' i = 1.2. Consider the linear combination y = lx , then

178

Multivariate statistics

and 2 y E=
(y | 2 ) l 2 and the variance of linear combination is
(y | 1 ) l1 =
1 y E=
=

V (y ) = y2 = ll .
Fisher's idea was to maximize the statistical distance between 1y and 2 y through a proper
selection of the vector of coefficients of linear combination, namely:

( 1 y 2 y ) 2
(l1 l 2 ) 2
maximize
=
maximize

ll
y2

The solution to this maximization is l =c 1 ( ) c 0 . The linear combination of


1
2
the observation vector and the vector l is known as the Fisher linear discriminant function,
y =lx =( ) 1x .
1

To classify a new observation, x 0 , using the Fisher linear discriminant function will get the
"score" for x 0 calculating y 0 = lx 0 = ( 1 2 ) 1x 0 .
If defined

m=

as the midpoint between the


1
1
( 1 y + 2 y ) =
( 1 2 ) 1 ( 1 + 2 )
2
2

average univariate of y, we have:

E (y 0 | 1 ) m 0 and E (y 0 | 2 ) m < 0 .
Then the classification rule will be assigned x 0 to the population if y 0 m and to the
1
population
itself
if
.
I.e.
if
the
new
observation
is
closest to the centroid of group
y0 < m

1 in the discriminant space will be assigned to group 1, otherwise it is classified in group 2.


When more than two groups or populations describe the structure of the observations, the
Fisher method is generalized under the name of canonical discriminant analysis. Is

H=

(xi x)(xi x) where x =


i=1

1 g
xi the matrix of sums of squares and cross
g i =1

products between groups or SCPC matrix associated with the hypothesis H on the effect of
groups
and
define
a
common
matrix
SCPC
of
the
error
terms
as E=

ni

=i 1 =j 1

=
(xij xi )(xij xi )

(n 1)S . Are the eigenvectors of

=i 1

E-1 H canonical

discriminant functions that separate the groups. The classification rule in this case suggests
assigning x 0 in the group with closer mean, in terms of statistical distance, to x 0 .

179

Multivariate statistics

Then, x 0

must

assign k

if

[l j ( x0 xk )]2 [l j ( x0 xi )]2

for

all

=j 1 =j 1

ik y r=
s min( g 1, p ) .
The first canonical axis (associated with the greatest eigenvalues i of E-1 H ) allows to
visualize the maximum separation between groups. In practice, only the first axis may be
necessary to explain the separation between groups. The eigenvalues provide a measure of
separation between groups in the direction given by the eigenvectors (coefficients of the
linear combination) for. If the maximum number of canonical axes can be obtained, the
value

1 + ... + r
is the proportion of variation among groups explained by the first r
1 + ... + s

canonical axes. When only two or three canonical variables properly describe the separation
between groups, usually the observations are plotted in the space defined by these axes to
reduce the dimension of representation. Then you can think of the DA as a dimension
reduction technique related to PCA. In the linear DA or axes canonical variables obtained
from the linear combination of quantitative variables, such that the combination explains the
variation between classes or groups in the same way as the linear combinations which are
major components in the PCA explained the total variation .
A useful way to obtain a measure of the importance of a response variable on a canonical
variable (or canonical axis) is through the standardization of the coefficients of the
corresponding linear combination. If D = [ Diag (E)]1/ 2 is the diagonal matrix of standard
deviations from the original variables, then ls = D l the coefficient vector is derived from
standardized canonical coefficients l. These coefficients are useful for judging the
contribution of each original variable in explaining the variability between groups.

Canonical correlations
Menu STATISTICS MULTIVARIATE ANALYSIS CANONICAL
CORRELATIONS, allows to calculate canonical correlations (correlations between groups
of variables) and test its statistical significance.
InfoStat automatically produces a series of tests of hypotheses which state that each
canonical correlation and all minor are zero in the population (Johnson and Wichern, 1998).
InfoStat uses the usual approach based on chi-square statistic, it is important that at least one
of the two sets have an approximately multivariate normal distribution for probability levels
are valid. At the output can be observed, for each feasible to compute canonical correlations,
canonical correlation coefficient (R), the proportion of total variance explained by each pair
of canonical variables (R2), the statistic (lambda) to test the hypothesis that the correlation
and all minor are equal to zero in the population, the degrees of freedom (df) and probability
levels associated with this test (p-value).
The coefficients associated with each original variable in the generation of a canonical
variable can be standardized or not standardized. InfoStat also performs partial canonical
180

Multivariate statistics

correlations specifying a clustering criterion in level of observations (not variables). The


partial canonical correlation analysis (Timm, 1975) is a multivariate generalization of the
common partial correlation analysis and interpreted the same way as the canonical
correlation analysis (CCA). The correlations are obtained from the covariance matrix of
residual obtained after adjustment for group effects.
InfoStat automatically adds to the data table values assumed by each of the canonical
variables (score of each observation on each linear combination defining a canonical
variable). Correlations between original variables and canonical variables can be ordered
from the menu CORRELATION ANALYSIS.
The multiple linear regression analysis between a canonical variable and all the original
variables in the opposite set can be done in the LINEAR REGRESSION menu to facilitate
interpretation of the canonical correlation. Scatter plots of each canonical variable versus its
counterpart in the other group are also recommended.
To make a CCA in InfoStat, in Variables of Canonical correlations window should
identify the variables that comprise the first group (variables in group 1 or dependent
variables) and those that make up the second group (variables in group 2 or independent
variables). When OK, another window appears in which you can choose to use Variables in
their original scale (using covariance matrix) and Standardized variables (correlation
matrix used).
Example 8: In a study of students in their final year of high school, he wanted to know if the
grades in quantitative subjects like Mathematics, Physics and Accounting or not correlated
with grades in subjects such as Language, Literature and History. The study was conducted
to analyze the results of 6 assessments, one for each subject, per student on a random
sample of students. Teachers responsible for this test felt that students who had good
performance in the areas of a quantitative nature would have it also on non-quantitative.
The data is in the CanCorr file.
Menu STATISTICS MULTIVARIATE ANALYSIS CANONICAL
CORRELATIONS, Canonical correlations window, Variables were selected,
Mathematics, Physics and Accounting for group 1 and Language, Literature and History for
group 2. In the next window you chose to use: Standardized variables (correlation matrix
used). In the Results window generated the following output:
Table 54: Canonical correlations. File CanCorr.
Canonical correlation

Correlation matrix
Accounting
Physics Math
Literature
Literature
1.000
0.597
0.853
0.870
History
0.597
1.000
0.778
0.768
Lenguage
0.853
0.778
1.000
0.982
Math
0.870
0.768
0.982
1.000
Physics
0.127
0.226
0.166
0.134
Accounting
0.865
0.566
0.760
0.738

History
0.127
0.226
0.166
0.134
1.000
0.347

Language
0.865
0.566
0.760
0.738
0.347
1.000

Canonical correlations
L(1)
L(2)
L(3)

181

Multivariate statistics
R
0.990
R
0.980
Lambda 68.246
df
9.000
p-value 0.000

0.601
0.361
7.297
4.000
0.121

0.148
0.022
0.344
1.000
0.558

Linear combination coefficients


L(1)
L(2)
L(3)
Literature
0.271 1.879 -0.470
History
0.036 -0.066 -1.624
Lenguage
0.731 -1.687 1.692
Math
Physics
Accounting

0.845 1.223 0.261


-0.018 0.478 -0.976
0.202 -1.578 -0.118

Note that the first canonical correlation R is 0.99, corresponding to the correlation between
the first pair of canonical variables, L(1). R2=0.98 value indicates that 98% of the variability
of the data is explained by this correlation. The test for the hypothesis that the first canonical
correlation and all others are equal to zero in the population, is based on the lambda statistic
with 9 degrees of freedom. In this example, the value of the statistic (68.24) is associated
with a p-value less than 0.001. Then the first canonical correlation between scores on
quantitative and non quantitative classes is significantly different from zero in the
population. The second canonical correlation, R=0.60, and lower correlations are not
significantly different from zero as you can see from the remaining p-values. In summary, a
canonical correlation is sufficient to measure the association, to score levels, between the
two types of materials.

Theoretical notions of canonical correlation analysis


The canonical correlation analysis (CCA) (Hotelling, 1936) is used to determine the linear
relationship between two groups considered metric variables as dependent variables and
other independent. I.e. that the analysis focuses on the study of the association between two
sets or groups of variables. For example, suppose you have variables that indicate the
liquidity of commercial firms and other variables that indicate the tax contribution of these
same firms, the CCA allows to identify and quantify the relationship between liquidity and
tax contribution, with these two characteristics not measured directly but through the
variables that make up each group.
The CCA provides a measure of correlation between a linear combination of variables in a
set (in this example, a linear combination of variables that measure liquidity), with a linear
combination of the variables in the other set (the combination of variables that measure tax
contribution).
In a first step of analysis, InfoStat determines the pair of linear combinations with maximum
correlation. In a second step, identify the pair with maximum correlation among all pairs
uncorrelated with the pair of combinations selected in the first step and so on. Linear
combinations of a pair are called canonical variables and the correlation between them is
called canonical correlation.
To interpret the canonical variables, recall that the simple correlation coefficient between
two variables, Y and X, (Pearson product moment coefficient) is defined as:
182

Multivariate statistics

r12 corr
(Y , X )
=
=

Cov(Y , X )
=
Var (Y )Var ( X )

12
11 22

Then, if x is a vector of q variables and l'x is a linear combination of x, the correlation


between a dependent variable Y and the combination l'x is:

=
ry ,l x corr
=
(Y , lx)

Cov(Y , lx)
Var (Y )Var (lx)

The vector l that maximizes the above correlation is the linear combination result of fitting a
multiple regression model of Y on x and you can show that:

=
ry ,l x corr
=
(Y , lx)

R2

where R2 is the coefficient of determination of multiple regression of y on the q variables in


x. After multiple regression analysis is a special case of CCA where one of the sets of
variables to correlate has a single element. Simple regression analysis corresponds to
another CCA case where both teams have only one item.
Now, if y is a vector of p variables, x is a vector of q variables and l1'y and l2'x are two
linear combinations, the correlation between these combinations is:

=
rl1 y ,l2 x corr
=
(l1y , l2 x)

Cov(l1y , l2 x)
Var (l1y )Var (l2 x)

The CCA is based on the vectors l1 and l2 obtained so that the correlation between linear
combinations of interest is maximum (canonical correlation).
To find the coefficients of such linear combinations is performed by singular value
decomposition of a matrix formed by the product of variance and covariance matrices for
both sets of variables.

11 12
to the covariance and variance matrix of vector partitioning
21 22

If we denote =

y
x , the canonical correlations (squared) in order from highest to lowest are the

1/ 2
1/ 2
eigenvalues (ordered from highest to lowest) of the matrix 11
and vector of
12 221 2111
coefficients of linear combinations related to y , i.e. vectors l1 are derived from the
1/ 2
eigenvectors, e1, for that matrix, making l1= e111
. The vectors of coefficients of linear

1
1/ 2
combinations l2 from the eigenvectors of the decomposition 221/ 2 2111
and
12 22

canonical coefficients l2'= e2 221/ 2 . The coefficients of the linear combinations are called
canonical weights. Commonly it is normalized so that the canonical variable has unit
variance.
183

Multivariate statistics

The first canonical correlation is never less than the largest of the multiple correlations
between any variable and one of the opposed group. It could happen that the first canonical
correlation is very high whereas all multiple correlations to predict a variable from the
opposite set are small.
The CCA assumes linear correlation, other correlations may be missed and distort the
analysis. The incorporation and elimination of variables can substantially change the
analysis as well as the presence of influential points. Common diagnostic techniques in
regression analysis can be used to identify influential points. The CCA does not require
multivariate normality unless you intend to obtain standard errors and hypothesis tests for
correlations.
The number of canonical correlations can be extracted from these decompositions is equal
to the minimum of the numbers p and q (cardinality of each of the sets of variables to be
correlated). Canonical correlation coefficients squared represent the proportion of total
variance explained by each canonical variable. Usually it is reported under the name "total
canonical structure" to the simple correlations between response variables and canonical
variables.

Partial Least Squares Regression


Menu STATISTICS MULTIVARIATE ANALYSIS PLS (Partial Least Squares) is
a relatively new multivariate statistical method. It is a technique that generalizes and
combines PCA and linear regression analysis. It is particularly useful when you want to
predict a set of dependent variables (Y) from a set (relatively large and possibly correlated)
of predictor variables (X). The objective of the PLS method is to describe Y from X and
structure of common variation.
When there are more observations than predictor variables and there is no problem of
multicollinearity, the prediction of Y based on X can be done efficiently with a multiple
linear regression analysis. PLS is used when there is a correlation between the predictors
and/or there are more predictors than observations. The estimation problem in these cases
could be solved by combining predictive linearly with PCA and then regressions Y with a
reduced number of principal components. But remember that the PC explained variation in
X and tell us nothing about the relationship of Y with X. On the contrary, PLS technique
seeks an optimal solution or compromise between the aim of explaining the maximum
variation in X and find correlations between these and Y.
If we call X and Y to the two sets of variables and assume that the number of variables in X
is m (X1, X2, ..., Xm) and the number of variables in Y is n (Y1, Y2, ... , Yn), it is possible to
construct a correlation matrix R such that its element Rij is the correlation between Xi (i=1, ..
m) and Yj (j=1, .., n). This matrix has ones on the diagonal and is usually not square. The
idea in PLS is to obtain a vector of m coefficients Ai, one for each variable in X and a vector
of n coefficients Bj, one for each variable in Y, such that the product A' (i.e., a matrix whose
ij entry is Ai*Bj) approximates well to the matrix R in the least squares sense (i.e.,
minimizing the sum of terms (Rij-Ai*Bj)2). We could say that these factors can combine each
184

Multivariate statistics

set of variables to explain the variability due to the relationship or correlation between the
two blocs.
A classic application of PLS is extended the multiple regression when there is a correlation
between the predictor or as indicated above when the number of observations is small
relative to the number of regressors. The implementation of PLS in InfoStat is aimed at
obtaining representation spaces similar to those obtained with PCA but involving an
additional along covariates used to explain the relationships that are displayed on the bi-plot
representation of a PCA between objects and variables. PLS results are presented through a
"tri-plot". We refer to tri-plot when you have a chart on the bi-plot also plotted covariates to
explain the association between the "point in row space" and "points of column space" a
biplot.

185

Multivariate statistics

Objectives
Discover and report the nature of the relationship of predictor variables with one or several
response variables (i.e., a set of response variables).

Data
You need I observations or cases described by n dependent variables (Y block variables)
and m predictors collected on these I cases in a data matrix Im (X block variables). The
data table in InfoStat must contain I cases and at least (m+n) columns.
Example 9: Factors limiting soybean file is used to illustrate an application of PLS to
explain genotype-environment interaction (GE) (agricultural season 01_02) according to
the following environmental covariates: Ra3, % MD, % pi, PrB2t and MO. In this campaign
involved 3 genotypes (A5520RG, A6040RG and DM4800RR) and 7 locations
(environments) Cavanagh, Totoras Oliveros, Cornfields, Bouquet, Rueda, and C. Gomez.
The Y matrix in
this
example
contains
interaction terms
between
7
locations and 3
genotypes and the
X matrix contains
the environmental
covariates previously described. In the table shown in shaded block of the X matrix and
unshaded block Y matrix (data: Factors limiting soybean).
Menu

STATISTICS

MULTIVARIATE ANALYSIS PLS,


Partial least squares (PSL) window,
Dependent
variables
were
selected
genotypes A5520RG, A6040RG and local
DM4800RR. As Class variables location
and as Predictor variables Ra3, %MD, MO,
%pi and PrB2t. These commands allow the
implementation of the SVD routine to PLS
and obtain tri-plot, executed on this table.
PLS analysis (columns of Y should be as
dependent variables, while rows as class
variables, the columns of X as a predictor).

186

Multivariate statistics

.
After activating the OK button, the Partial
least squares (PLS) window, in which the
SVD options, Ys Standardize, Standardize
Xs, Save latent variables and Overwrite are
branded by default. This example also labeled
Triplot and 5 roots, as shown in the window
attached.

3.00

DM4800RR

Factor 2 (24.1%)

1.50

Oliveros
Ra3

0.00

PrB2t
%Md Maizales
%pi

Cavanagh
A5520RG
OM

C.Gmez

A6040RG Rueda

-1.50

-3.00
-3.00

-1.50

0.00

1.50

3.00

Factor 1 (36.0%)

Figure 30: Tri-plot of correlation between an interaction matrix of three genotypes and
seven environments versus an environment covariete matrix. File: Factors limiting
soybean
GE interaction is explained in full from the first two PC, as shown by the eigenvalues. The
scores of genotypes and environments for the study of interaction is present at the start, they
serve to associate genotypes with environments, but not to explain this association with
variables X, the new latent variables obtained from the PLS technique is shown in the
results window (not presented here). By correlating the matrix of residues of AMMI(2)
187

Multivariate statistics

model with the environmental covariates, covariates with more "inertia" on the shaft 1 of the
tri-plot proved Ra3 and MO (the radios in Figure presented, for these variables are a large
projection on the axis 1). Then the interactions detected in this data set, from the
environmental point of view, are mainly attributed to these two variables.
Ra3 values relatively high recorded in Cavanagh and Totoras, they could explain the better
performance was A5520RG genotype compared to the other in these locations. The MO was
also relatively high in Cavanagh and very low in Totoras Oliveros. Soil characteristics not
related to MO, were not important in explaining the interactions in this campaign. The
cultivar A6040RG served on the other two cultivars, better wheel and Oliveros, interaction
with Wheel is negatively correlated with Ra3. The second dimension of the tri-plot is
associated with better adaptations of DM4800in Oliveros which has a lower MO content
than the other sites.

Multivariate analysis of variance


Menu STATISTICS MULTIVARIATE ANALYSIS MULTIVARIATE
ANALYSIS OF VARIANCE allows to test hypotheses about equality of mean vectors in
two or more populations.
When studying p variables for each level of one or more design factors, the multivariate
analysis of variance (MANOVA) was used to make simultaneous inferences about the
effects of the factors of the analysis model. Analysis models may involve both ranking
factors as covariates (continuous variables). The rating factors can be crossed or nested and
the expression on the right side of the model equation is written in InfoStat following the
same guidelines established for the univariate analysis of variance (see Analysis of
variance). Unlike the univariate analysis of variance, in this module you should select more
than one dependent variable. For analysis of variance involving several variables, a missing
value in one of the dependent variables eliminates the complete observation.
InfoStat automatically provides four different statistical multivariate hypothesis testing. For
each statistical are also reported F approximations (Johnson and Wichern, 1998). InfoStat
allows to define specific matrices to test hypotheses related to differences between groups
(levels of a classification factor) for each of the dependent variables as well as for linear
combinations of these variables. The multivariate general linear hypothesis is expressed as:
H0:CBA=0, where C can make contrasts between the rows of B (e.g., group effect) and A
allows to define new variables from linear combinations of the columns of B. If you do not
specify a particular matrix, InfoStat assumes that A is the identity matrix, otherwise the tests
are performed for the transformed variables defined by the columns of the matrix A. In the
multivariate analysis of variance window dependent variables are declared and the
classification variable. To OK a window with the following tabs: Model, Comparisons,
Contrasts/linear combinations and Lineal combinations of columns appears.
In the flap Model in Specification of model terms, you can write the desired pattern, in
Classification variables appears variables on the right side of the equation of the model
reported in the previous window. In Covariates were selected on the variables you want to
include as such.
188

Multivariate statistics

On the flap Comparisons, you can select a method of comparing a posteriori. In Means to
compare the factor is chosen for which will compare the mean or All means (for one-way
designs to classification, the two options produce identical results.) You may decide, as in
univariate ANOVA module, the presentation in a list or matrix of the means to compare and
the level of significance of the test. You have the Hotelling and corrected Hotelling test by
Bonferroni inequality to do comparison means between groups. If the user does not declare
a matrix of transformations of response variables, the profile of each group will consist of
all the dependent variables selected for analysis. If there is an A matrix other than identity,
the vectors to be compared is up from the transformed variables.
On the flap Contrasts/lineal combinations, C matrix is specified, the user must select the
to be factor whose levels desire being contrasting in Select terms. Performing this action
InfoStat will display factor levels selected in the Treatments window. In To specify
Cmatrix (rows) for the hypothesis H0:CBA=0 you must enter the C matrix, i.e. the
coefficients of the contrasts or linear combinations of the selected factor required. When you
enter contrasts InfoStat allows Check orthogonality of the same if the user requires.
In the Lineal combinations of comuns flap, you specify the transpose of the A matrix of
the multivariate general linear hypothesis. In Specify B matrix, the user can view the
dependent variables under study to facilitate the specification of A.
For Iris data file is shown below the output of multivariate analysis of variance to test the
hypothesis of equal mean vectors between the three species ("treatments"). Contrast was
also performed to compare the species 2 to 3. Menu was invoked STATISTICS
MULTIVARIATE ANALYSIS MULTIVARIATE ANALYSIS OF VARIANCE, in the
window Multivariate analysis of variance the response variables were designated
"SepalLen", "SepalWid", "PetalLen" and "PetalWid" as Dependent variable and the
variable species was selected as Class variables. On the Contrasts/lineal combinations
flap, Select terms, was chosen "species." Treatments listed in 1,2,3 (all three levels of
selected factor). In To specify C matrix (rows) for the hypothesis H0:CBA=0 is written 0
1 -1 to compare the mean vector of species 2 with that of the species 3.
Table 55: Multivariate analysis of varince.
Multivariate analysis of variance
Analysis of variance table (Wilks)
S.V. Statistic
F
df(num) df(den) p
Species
0.02
199.15
8
288 <0.0001
Analysis of variance table for contrasts(Wilks)
Species
Statistic
F
df(num) df(den) p
Contrast1
0.25
105.31
4
144 <0.0001
Total
0.25
105.31
4
144 <0.0001

Analysis of variance table (Pillai)


S.V. Statistic
F
df(num) df(den) p
Species
1.19
53.47
8
290 <0.0001
Analysis of variance table for contrasts(Pillai)
Species
Statistic
F
df(num) df(den) p
Contrast1
0.75
105.31
4
144 <0.0001
Total
0.75
105.31
4
144 <0.0001

189

Multivariate statistics
Analysis of variance table (Lawley-Hotelling)
S.V. Statistic
F
df(num) df(den) p
Species
32.48
580.53
8
286 <0.0001
Analysis of variance table for contrasts(Lawley-Hotelling)
Species
Statistic
F
df(num) df(den) p
Contrast1
2.93
105.31
4
144 <0.0001
Total
2.93
105.31
4
144 <0.0001

Analysis of variance table (Roy)


S.V. Statistic
F
df(num) df(den) p
Species
32.19
1166.96
4
145 <0.0001
Analysis of variance table for contrasts(Roy)
Species
Statistic
F
df(num) df(den) p
Contrast1
2.93
105.31
4
144 <0.0001
Total
2.93
105.31
4
144 <0.0001
Contrasts coefficients
Species C'(1)
1
0.00
2
1.00
3
-1.00

The first two tables correspond to multivariate analysis of variance to test the equality of
mean vectors between the three species and contrast between 2 and 3 species based on the
Wilks statistic (0.02 and 0.25, respectively). Also note the approximate F values for both
tests (199.15 and 105.31), degrees of freedom for the F statistic and probability levels
associated with each test. As can be seen in both cases the p value is less than 0.001, i.e.
there are significant differences between the centroids of multivariate observations of the
three species between the centroid of observations of species 2 with respect to the species 3.
Tests conducted jointly involve parameters of the 4 dependent variables, thereby testing the
equality of groups from the 4 variables simultaneously. In the same way should be analyzed
three sets of tables remaining, which present the same evidence, but made assumptions
based on Pillai, Lawley-Hotelling and Roy statistical. Finally InfoStat reports the C matrix
specified by the user to perform one or more contrasts (in this example contains only a C
contrast).

Theoretical notions of multivariate analysis of variance


The underlying linear model can be expressed in matrix terms as follows:
=
Y XB +

where Y is a matrix np with n=number of observations or file cases and p=number of


selected variables as dependent, X is a nk matrix with k=number of fixed parameters
associated with a model with one variable, B is a kp matrix containing the fixed parameters
associated with p variables and is a np matrix of error terms. It is assumed that the n rows
of are independent and have normal p-varied distribution with variances and covariance
matrix with pp dimensin for esch case. Then collecting all error terms of matrix in a
vector vec(), InfoStat assumes that: vec( ) ~ N (0,I n ) , where vec() is the function
arranged in vector form the matrix elements and the symbol denotes Kronecker product;
is the variances and covariances matrix between the response variables inside each
190

Multivariate statistics

observation. Then in the MANOVA is performed the assumption that the error terms are
independent across observations but not between variables. Multivariate normality is
necessary
to
test
hypotheses.

matrix
is
estimated
by

S= [( e'e ) ( n r )]= ( ( Y XB ) ' ( Y - XB ) ) ( n r ) where B = ( X'X ) X'Y , r is the rank of


the X matrix and e is the residuals matrix.
If S is scaled to ensure that the diagonal elements are ones, the remaining elements of S are
called partial correlations of the dependent variables, adjusted for variables on the right side
of the model. InfoStat allows this matrix to obtain by user's request.
Specific matrices of contrasts and lineal combinations of the response variables and model
effects allow you to test a wide range of scenarios. The multivariate general linear
hypothesis is expressed as: H0:CBA=0, where C can make contrasts between the rows of B
(e.g., group effect) and A allows to define new variables from linear combinations of the
columns of B.
Multivariate statistical tests are performed from the estimate of the matrices H and E on
sums of squares and cross products associated with the hypothesis and error,
respectively. When a model is g factor levels, the estimation of the matrix of the hypothesis
on equality of mean vectors between the groups of observations defined by the levels of this
g

factor is H =

( xi x)( xi x) where x =

squares

and

i=1

E=

ni

(x

=i 1 =j 1

ij

cross

xi )(xij xi )=

products

1 g
xi and the residual matrix of sums of
g i =1
of

terms of

error

are

(n 1)S . These matrices satisfy the same mission as

=i 1

the numerator and denominator of the univariate F statistic, i.e. H matrix provides estimates
of the variation (and covariation) between groups and the E matrix provides estimates of the
variance (and covariance) within groups. The matrices H and E are constructed from the
following general expressions:

H = A'(CB)'(C(X'X)
C')-1 (CB)A

E = A'(Y'Y - B'(X'X)
B)A

InfoStat provides four multivariate test statistics for each, all functions of the eigenvalues of
E-1H (o (E+H)-1H) (Pillai, 1960). These are:
Wilks Lambda=det(E)/det(H+E)
Pillai trace=trace(H(H+E)1)
Hotelling-Lawley trace=trace(E-1H)
Roy maximun eigenvalue=, is the largest eigenvalue of E-1H

Repeated measures over time (multivariate approach)


When measuring a variable on the same experimental unit at different points in time are
obtained serial measurements that are characterized by correlated since they carry the same
191

Multivariate statistics

effect of experimental unit. The process of collecting information on the same unit
produces a series of repeated measurements on each unit. May exist or not other factors that
are recognized as sources of variability between experimental units or subjects (between
subject factor), but always in such studies recognize the time factor as one with potential to
introduce variability among observations within the same unit (within subject factor).
Repeated measurements over time can be analyzed as multivariate profiles, where the
responses observed at each point in time represent the variables of analysis. That is, each
observation corresponds to a t-dimensional vector where t represents the number of
instances in time in recording the value of the variable of analysis. The repeated measures
multivariate analysis allows us to model the correlations between serial observations.
The classical hypothesis testing in an analysis of repeated measurements over time with
treatment factor grouping the observations are: 1) no treatmenttime interaction, 2) no effect
of time and 3) no treatment or group effect. Through the use of specific matrices A and C
can test these hypotheses in InfoStat.
Using analysis of variance (univariate) module, you could fit a split-plot model to the data
of experiments with repeated measurements. In this model the treatment factor associated
with the experimental units of larger size and the time factor to the "subplot" (Winer, 1971,
Morrison 1976). Individuals or units nested within the main factor constituting the error
term for the factor treatment. The analysis is appropriate only if the measures are correlated
in time (spherical model) or you can hold the assumption of equal correlation between any
pair of repeated measurements on the same individual (compound symmetry model) for the
variance and covariance matrices observations within the same experimental unit. If the
covariance matrices within the subject have these characteristics are said that comply with
Huynh-Feldt condition (Huynh-Feldt, 1970). On the other hand, the multivariate approach to
repeated measures analysis (Cole and Grizzle, 1966) assumes no particular model for the
matrix, but is based on the estimation of all possible covariances between repeated
measurements. This unstructured model, a little parsimonious, must be addressed when you
have enough observations to estimate parameters. In general, is requested that the number of
repeated observations is less than or equal to the number of repetitions of the experiment.
Intermediate covariance structures between compound symmetry and the unstructured can
provide solutions of interest in the practice of this type of measurements (mixed modelbased approach). In the 2008 version of InfoStat, there is a set procedure for mixed
models. This procedure uses the R language and calculation engine and builds an interface
with the lme routine and nLME Bookseller gls. A complete manual explaining this module
can be found in Statistics, General and Mixed models, Tutorial.
The organization of the data table for the analysis of repeated measurements through
multivariate approximation must correspond to the structure requested by the
MANOVA.Variable records made at different points in time to appear in different
columns. Then, in addition to the columns associated with classification criteria and/or
covariates, the table will have at least an additional t number of columns.
Example 10: there are two treatments where there is a variable V at 3 time points on three
subjects per treatment, the file must have the following format:
192

Multivariate statistics
Table 56: Data organization to perform repeated measures analysis. File RepMeasures.
Trat
1
1
1
2
2
2

Suj
1
2
3
1
2
3

V1
125
107
167
167
163
173

V2
182
178
182
199
208
192

V3
206
201
223
250
217
233

In InfoStat, the hypotheses of interest can be tested through two runs of the multivariate
analysis of variance:
Run 1: For RepMeasures data file (data from the table above) was invoked Menu
STATISTICS MULTIVARIATE ANALYSIS MULTIVARIATE ANALYSIS OF
VARIANCE. In the window Multivariate analysis of variance the response variables were
designated V1, V2 and V3 as Dependent variable and the variable "Treatment" was
selected as Class variables. On the flap Contrasts/lineal combinations in Select terms,
was chosen "Treatment". Treatments listed in 1 and 2 (the two factor levels selected). To
Specify C matrix (rows) for the hypothesis H0:CBA=0 was written "1 1" and was
activated Linear comb. box to indicate that just entered a linear combination instead of a
contrast. The number of each entry must be equal to the number of factor levels selected, the
treatments window easy to remember the number. Finally, the flap Lineal combinations of
columns is activated and in the space provided to write the A' matrix will enter the next
two rows of contrasts: 1-1 0 and 0 1-1. The lists of active variables appear under the title
Specify B matrix (columns). The A' matrix must have as many columns as variables to be
displayed in the list; the number of rows is t-1. Returns the following results (it has left only
the results for the Wilks statistic, the interpretation is identical to the other three statistics
reported by InfoStat):
Table 57: Repeated measures output (run 1). Archivo RepMeasures
Multivariate analysis of variance

Analysis of variance table (Wilks)


S.V.
Statistic
F
Treatment
0.83
0.31

df(num) df(den) p
2
3 0.7560

Analysis of variance table for linear combinations(Wilks)


Treatment
Statistic
F
df(num) df(den) p
Row 1 in C
0.04
41.30
2
3 0.0066
Total
0.04
41.30
2
3 0.0066

Analysis of variance table (Pillai)


S.V.
Statistic
F
Treatment
0.17
0.31

df(num) df(den) p
2
3 0.7560

Analysis of variance table for linear combinations(Pillai)


Treatment
Statistic
F
df(num) df(den) p
Row 1 in C
0.96
41.30
2
3 0.0066
Total
0.96
41.30
2
3 0.0066

Analysis of variance table (Lawley-Hotelling)


S.V.
Statistic
F
df(num) df(den)

193

Multivariate statistics
Treatment

0.20

0.31

3 0.7560

Analysis of variance table for linear combinations(Lawley-Hotelling)


Treatment
Statistic
F
df(num) df(den) p
Row 1 in C
27.53
41.30
2
3 0.0066
Total
27.53
41.30
2
3 0.0066

Analysis of variance table (Roy)


S.V.
Statistic
F
Treatment
0.20
0.31

df(num) df(den) p
2
3 0.7560

Analysis of variance table for linear combinations(Roy)


Treatment
Statistic
F
df(num) df(den) p
Row 1 in C
27.53
41.30
2
3 0.0066
Total
27.53
41.30
2
3 0.0066
C-matrix coefficients'
Treatment
C'(1)
1
1.00
2
1.00
A-matrix
Defining linear combinations of columns in B
Variable
Col(1) Col(2)
V1
1.00
0.00
V2
-1.00
1.00
V3
0.00 -1.00

The interaction time and treatment hypothesis is proven from the first table where the source
of variation is given by the factor. Due to the linear combination of dependent variables
given through the matrix A, the Wilks statistic =0.83 which can approach the 0.31 value of
F with 2 and 3 degrees of freedom, allows you to test the interaction hypothesis of timetreatment. The value p=0.7560 suggests that there is no evidence to reject the hypothesis of
lack of interaction, i.e. differences between treatments do not change over time. The second
table is associated with the hypothesis of lack of time effect, this hypothesis is proved from
the differences in the response variable between two subsequent time points (integrated over
all treatments). The probability of Wilks' statistic values greater than the observed is
p=0.0066, thus rejecting the hypothesis of lack of time effect. Then the profiles of both
treatments change over time. It is unclear what the trend in time is only known that the
response is not constant. At the end of the exit InfoStat provides C and A matrices specified
by the user.
Run 2: It aims to test the hypothesis of equal treatment means. It is proved by integrating all
treatment responses over time through the A matrix. For this you have to return at window
Multivariate analysis of variance, leaving the same selection that the previous run. Clean
the flap Contrasts/lineal combinations (in To specify C matrix (rows) for the hypothesis
H0:CBA=0 there should be nothing). In Lineal combinations of columns flap and in the
space for writing the A' matrix is entered a single row of ones (some as many variables
under the title Specify B matrix, note that A is the unit vector). The results will obtain in
Table 58 (shown only for the statistical results of Wilks, the interpretation is identical to the
other three statistics reported by InfoStat). The results are identical for all four multivariate
statistics because in the combination of the columns in a single variable provides a test and
194

Multivariate statistics

therefore an univariate F statistic is accurate. In this example, the differences between


treatments are significant at 5%.
Table 58: Repeated measures output (run 2). Archivo RepMeasures
.Analysis of variance table (Wilks)
S.V.
Treatment

Statistic
0.32

F
8.33

df(num) df(den)
p
1
40.0447

Analysis of variance table (Pillai)


S.V.
Statistic
F
df(num) df(den)
p
Treatment
0.68
8.33
1
40.0447

Analysis of variance table (Lawley-Hotelling)


S.V.
Statistic
F
df(num) df(den)
Treatment
2.08
8.33
1

Analysis of variance table (Roy)


S.V.
Statistic
F
df(num)
Treatment
2.08
8.33
1

p
40.0447

df(den)
p
40.0447

A-matrix
Defining linear combinations of columns in B
Variable Col(1)
V1
1.00
V2
1.00
V3
1.00

For tests involving just one main factor (between subject factor) the approximation of
univariate and multivariate split plot produced the same results. The differences in
probability levels occur in tests involving the time factor (within subject factor). The A
matrix used to test the time effect hypothesis in the previous example, produces a
transformation of the dependent variables in differences between the records of the variable
at successive times. This transformation produces the analysis usually known as profile
analysis (Johnson and Wichern, 1998). Other matrices, which do not change the overall
results of these tests, may be proposed to analyze the nature of the responses over time. For
example, if a time factor level can be seen as a control or reference reading could use
A matrix to allow time to study the effect through the contrast of all time with the
reference time.

1 1 0
A' =

1 0 1
By examining each of the contrasts implemented a user can identify points in time when
there were different responses to control.
With t levels of time is possible to test linear t-1 trends from A matrix containing the
coefficients of orthogonal polynomials for this purpose. For the example above, the
following matrix can test whether changes over time reflect a lineal and/or quadratic trend.

195

Multivariate statistics

1
A' =
0.5

0
1
1 0.5

level (time) of the response variable with the mean of subsequent levels. This matrix is
suitable for experimental situations where they wish to identify the moment of time in which
the response is stable (keeps changing). In the example in question this transformation
should be represented by the following matrix:

1 0.5 0.5
A' =
1
1
0
To determine the point (moment of time) which produces a peak (positive or negative) on
the profile of the observations in time; you should perform an analysis of variance for the
transformed variables representing the difference in response between adjacent time points.

Distances and association matrices


Menu MULTIVARIATE ANALYSIS DISTANCE AND ASSOCIATION
MATRICES allows to obtain a set of selected variables correlation, association measures
and distances. In the Variables selector can specify the variables of interest and one or more
criteria for Class variables. If there is no classification variables select "case." To OK a
dialog box appears in which you can choose between two tabs: Continious trait and
Discrete trait. In the two flaps you can choose to Standardize data and analysis by rows
or columns.
You can choose the statistics to generate distance matrices. InfoStat produces a matrix nn
or pp depending on the indicated analysis by rows or columns, respectively. The matrix
elements reported are measures of distance between rows or columns points obtained from
the selected statistic. Distances offered are: Euclidean, Euclidean average, squared
Euclidean, Manhattan, Manhattan average, Bray-Curtis; Bray-Curtis average (Camberra)
and Excoffier. Additionally, you can calculate distances as functions of similarity
measures. Offered Similarity measures to obtain distances are: Gower, Pearson correlation,
Spearman correlation, simple matching, positive matching, Jaccard, Anderberg, Rogers and
Tanimoto, Ochiai, Dice, Kulczynsky 1, Braun Blanquet, Hamman, Sokal Sneath 1, Sokal
Sneath2, Sokal Sneath 3, Phi coefficient, Yule Kendall. When you select one of these
similarities, a subwindow will automatically appear on the right to select the function to be
used to transform the similarities in distance. All are functions of S, where S is the selected
similarity.

196

Multivariate statistics

Theoretical notions about distance measurements and association


A metric measure of distance between two points, say P and Q, satisfies the following
requirements:
d ( P, Q) = d (Q, P ) distance is symetric,
d ( P, Q) > 0 if P Q ,
d ( P, Q) = 0 if P = Q, y
d ( P, Q) d ( P, R ) + d ( R, Q) , triangle inequality.

Distance measurements can be converted to measures of similarity between observations. In


the measures of similarity, as opposed to distance, the value obtained is greater when closer
to the elements considered. And distance measures have certain properties, similarity
measures must be met:
0 s ( P, Q ) 1

s ( P, P ) = 1
s ( P, Q) = s (Q, P )

Euclidean distance or distance in a straight line from p-dimensional observation, the origin
is
d (O=
, x) L=
( x)

x12 + x22 + ...x 2p

and represents a generalization of the Pythagorean Theorem which states the distance between a
point and the origin in the plane,
d (O=
, x)

x12 + x22

197

Multivariate statistics

The Euclidean distance between two arbitrary points into p-dimensional space is the square
root of the sum of p differences (squared) between the values assumed by each variable in
the pair of observations in question,
)
d ( x, y=

( x1 y1 ) + ( x2 y2 ) + ... + ( x p y p ) 2
2

This distance varies with the scale of the coordinates, so it can be completely distorted by
simply changing of unit of measurement from study variables. As an example, see the
following table:
Table 59: Example with wight an height of three persons.
Persona
A
B
C

Peso (en libras)


160
163
165

Altura (en pies)


5.5
6.2
6.0

Euclidean distances are: dAB=3.08; dAC=2.5; dBC=2.01. However, if the height is measured in
inches, the distances are: dAB=8.92; dAC=7.81; dBC=3.12. In the last case, person A is closer
to C than B, while in the former case the opposite happens. The lack of scale invariance
suggests standardizing the data (dividing by the standard deviation) before calculating the
Euclidean distances. In this sense, the relative distances (distances between ranges) remain
unchanged.
The Manhattan distance is the sum of the absolute values of differences between each pair
of coordinates that define the p-dimensional observation. The Manhattan metric is often
used with ordinal data and interval scale while the Euclidean metric is used with continuous
data. The absence of observations in one or more variables creates problems when
calculating distances. In practice, multivariate observations with some missing values tend
to be discarded completely. If you are part of the analysis can calculate average distances
obtained by dividing the distance between two observations obtained by the number of
variables with realizations in both observations.Thus the relative comparison of magnitudes
of distance is not biased if the distances have been calculated using different numbers of
coordinate pairs. In this sense, InfoStat provides the distances: Euclidean average and
Manhattan average. When there are non-metric variables or mixtures of metric and nonmetric variables, the classical geometric distances presented above to measure proximity of
objects are not recommended. In such situations should use distance measures that do not
require compliance with the triangle inequality (Spath, 1980).
In many applications, each sample is measured by the presence or absence of p features and
similarities between the units or observations should be built from information about
presence/absence. Various measures of similarity for dichotomous or binary data are derived
from contingency tables of two sets of data indicating the presence or absence. Suppose that
from the characteristics observed in each of two subjects or sampling units are constructed
the following table:

198

Multivariate statistics
Tabla 60: Organization ob binary data.
Subject 2
Presence
Absence

Subject 1
Presence
a
c

Absence
b
d

Then, a represents the number of features present in both subjects in the sample, d is the
number of features absent in both, and b and c are the number of features present in a
subject but absent in another. Put another way, a and d represent a positive matches or
related pairs (1.1) or negative (0.0). By contrast b and c represent non-associated pairs. The
distances based on the Jaccard index, simple matching coefficient (simple matching),
matching the positive coefficient, the rate of Rogers and Tanimoto, and the rate of
Anderberg, among others, are used as a measure of similarity obtained from this
information.
There are several measures of association for the purpose of weighing the significance of the
joint presence or absence of features together as appropriate in each case. In some
circumstances the absence together, for example, may not imply similarity between two
observations. For example, in ecological studies of vegetation, the presence of two rare
plants on two sample plots may imply that both plots are similar, while their joint absence of
two other places, probably do not say anything about the similarity of them. The major
differences between these coefficients are due to: 1) whether the negative associations are
incorporated into the measure, 2) if the associated pairs have equal weight to nonassociated pairs and 3) if non-associated pairs have equal weight than associated pairs. The
following table shows the calculation of various coefficients of association or similarity
provided by InfoStat:
Table 61: Similarity coefficients.
1. Single matching
2. Positive matching
3. Jaccard
4. Anderberg
5. Roger y Tanimoto
6. Ochiai
7. Dice
8. Kulczynski 1
9. Kulczynski 2
10. Braun-Blanquet
11. Hamman
12. Sokal y Sneath 1

(a + d ) (a + b + c + d )
a (a + b + c + d )
a (a + b + c)
a a + 2 ( b + c )
(a + d ) ( a + d + 2 ( b + c ) )
a ( a + b )( a + c )
2a ( 2a + b + c )
a (b + c )
0.5 { a ( a + c ) + a ( a + b ) }
a/max[(a+b),(a+c)]

( a + d ) ( b + c ) ( a + b + c + d )

(a + d ) ( a + d + 0.5 ( b + c ) )

199

Multivariate statistics

13. Sokal y Sneath 2

0.25 a ( a + b ) + a ( a + c ) + d ( d + b ) + d ( d + c )

14. Sokal y Sneath 3

( a.d )

15. Coeficiente Phi

( a.d ) ( c.b )

16. Yule y Kendall

( a.d ) ( b.c ) ( a.d ) + ( b.c )

( a + b ) . ( a + c ) . ( d + b ) . ( d + c )
( a + b ) . ( a + c ) . ( b + d ) . ( c + d )

The simple matching measure, and their weighting in the same manner as a set matches to
those recorded by d.
The positive matching measure is useful when the simultaneous presence is more important
to quantify the similarity of the simultaneous absence.
The Jaccard index is useful if you want to emphasize the possession of attributes, i.e.
situations (1.1), (1.0) and (0.1).
Anderberg index is characterized by giving more weight to non-associated pairs (1.0) and
(0.1).
The Rogers and Tanimoto coefficient gives double weight to the non-associated pairs, but
consider the (0.0) in their calculation.
When working with ordinal variables or when the interest is in grouping variables than
observations are recommended similarity measures based on sample correlation coefficients
(Pearson and Spearman). Both correlation coefficients take values between -1 and 1. The
sign indicates the direction of the correlation and the absolute value measures the strength of
the correlation. The Spearman coefficient can be viewed as a non-parametric of Pearson
coefficient; since the data are transformed into ranks before calculating the correlation (see
Correlation). The nonparametric coefficient of Kendal Tau works with ranges sorted
variables, also assumes values between -1 and 1. InfoStat can implement the following
measures of distances based on similarity measures 1-S, sqrt (1-S), -log (S), 1/S-1,
sqrt(2(1-S)), 1-(S+1) / 2, 1-abs(S) and 1/(S+1), where S is the similarity from which you
want to get a distance, sqrt is the square root, abs is the absolute value function, and log is
the log (see Table 9). Moreover, it is noteworthy that the correlation coefficient is related to
the Chi square statistic ( r 2 = 2 / n ) for testing independence of two categorical
variables. For n fixed, a high correlation (similarity) is consistent with a lack of
independence (Johnson and Wichern, 1998). InfoStat also provides Chi-square distance,
which is obtained from classical statistical 2 for contingency tables as a measure of distance
of each cell with respect to its expected value.

200

Multivariate statistics
Table 62: Functions to obtain distance measures from similarity index.
Range of

Function

Sij

Range of

d ij

1.

dij = 1 Sij

[0,1]

[0,1]

2.

d=
ij

1 S ij

[0,1]

[0,1]

3.

dij = log Sij

(0,1]

[0, )

=
dij 1 Sij 1

(0,1]

[0, )

[0,1]

[0, 2]

dij =
1 ( Sij + 1) 2

[1,1]

[0,1]

dij = 1 ( Sij + 1) 2

[1,1]

[0,1]

4.

=
dij

5.
6.
7.

2 (1 Sij )

Correspondence Analysis
Menu MULTIVARIATE ANALYSIS CORRESPONDENCE ANALYSIS allows
simple correlation analysis and multiple on categorized data tables containing variables.
In the Selector variables you can specify Class variables and if necessary Frequencies. To
OK a dialog box will appear where you can choose the following options: BURT Matrix,
Row profiles, Column profiles, Total relative frequencies, Expected frecuencies for the
chi-square statistics, Deviation for expected value under independence, Individual
contribution to Chi square statistics, Singular values, Row coordinates, Column
coordinates, and Biplot. There are also options Relative frequencies as percentages and
Extract 2 axes, both of which are modifiable.
Example 11: The simple CA was carried out on a study that addresses the characterization
of women with alcohol-related problems from demographic and psychological
characteristics. While relieved categorized a set of variables such as age, occupation,
marital status, reason for visit and diagnosis of the patient to enter rehab. Simple CA was
used to study the association between age and reason for consultation. The data (courtesy of
Yolanda and Linda Diosque Prados, Department of Psychology, UNC), found in the
Alcoholism file.
We present the biplot obtained by performing the simple CA variable "reason" and
"age." The arrangements for variable reason for consultation were: C-Far (drug use), C-Sus
(use of substances that generate addition), C-Der (referred from other clinics), C-Des (desire
to stop drinking ), C-Alc (alcohol), C-EsA (mood), C-Vio (domestic violence), C-Fis
(physical symptoms). For the variable "age" the rules were: Young (under 30), Adult (30 to
50 years) and Old (more than 50 years). To obtain the biplot is used the following
commands:
Menu MULTIVARIATE ANALYSIS CORRESPONDENCE ANALYSIS. In Class
variables were chosen "reason" and "age." In the next window you left the defaults:
201

Multivariate statistics

Individual contribution to Chi square statistics, Singular values, Row coordinates,


Column coordinates, Biplot, and Extract 2 axes.
0.82

C-Fis

0.52
C-Smo

Old
C-Drug

Axis 2

C-Vio

Young

0.22

C-Anim
C-Alc

-0.09

Adult C-Der
C-Pha
-0.39
-0.88

-0.29

0.31

0.90

1.50

Axis 1

Figure 31: Biplot. File Alcoholism.


The figure displays the first two dimensions of simple AC contingency table for the crossing
of the variables "age" and "reason". The figure suggests, in its first axis (with an inertia of
73.99%) that girls (under 30 years) consulted mostly by substance use (C-Sus) and those
over 50 years consulted by desires to stop drinking (C-Des) and physical symptoms (C-Fis).
Middle-aged women cited alcohol consumption, mood, psychotropic use and derivation as
the main complaints. The points represent forms of the same variable response may be
automatically connected in InfoStat. In the Output window, got a table like this:
Table 63: Simple correspondence analysis. File Alcoholism.
Correspondence analysis

Absolute frequencies
In columns:Reason
In files: Age

Adult
Old
Young
Total

202

C-Alc
19
6
2
27

C-Anim C-Der
12
17
4
3
2
3
18
23

C-Drug C-Fis
3
1
0
2
5
0
8
3

C-Pha
8
1
1
10

C-Smo
5
5
0
10

C-Vio
5
3
2
10

Total
70
24
15
109

Multivariate statistics
Row profile (percents)
In columns:Reason
In files: Age

Jov.
May
Med.
Total

C-Alc
0.13
0.25
0.27
0.25

C-Der
0.20
0.13
0.24
0.21

C-Des
0.00
0.21
0.07
0.09

C-EsA
0.13
0.17
0.17
0.17

C-Far
0.07
0.04
0.11
0.09

C-Fis
0.00
0.08
0.01
0.03

C-Sus
0.33
0.00
0.04
0.07

C-Vio
0.13
0.13
0.07
0.09

Total
1.00
1.00
1.00
1.00

C-Der
24.29
12.50
20.00
21.10

C-Drug C-Fis
4.29 1.43
0.00 8.33
33.33 0.00
7.34 2.75

C-Pha
11.43
4.17
6.67
9.17

C-Smo
7.14
20.83
0.00
9.17

C-Vio
7.14
12.50
13.33
9.17

Total
100.00
100.00
100.00
100.00

C-Pha
0.39
0.66
0.10
1.15

C-Smo
0.31
3.56
1.38
5.25

C-Vio
0.31
0.29
0.28
0.89

Total
2.86
9.82
16.88
29.56

(%)
73.99
26.01

Cumulative %
73.99
100.00

Row profile (percents)


In columns:Reason
In files: Age

Adult
Old
Young
Total

C-Alc
27.14
25.00
13.33
24.77

C-Anim
17.14
16.67
13.33
16.51

Cell contribution to chi-square statistic


In columns:Reason
In files: Age

Adult
Old
Young
Total

C-Alc C-Anim C-Der


0.16
0.02 0.34
5.1E-04 3.4E-04 0.84
0.79
0.09 0.01
0.95
0.11 1.19

C-Drug C-Fis
0.89 0.45
1.76 2.72
13.81 0.41
16.46 3.57

Contribution to chi-square
Eigenvalue
Inertia Chi-square
1
0.45
0.20
21.87
2
0.27
0.07
7.69

Theoretical notions about correspondence analysis


Correspondence analysis (CA) is an exploratory technique to graphically represent rows and
columns of a contingency table (Greenacre, 1984, 1988.1994; Lebart et al, 1984).
Psychology often refers to this technique as dual scaling; in Ecology has been widely used
for discrete data system of vegetation (presence or absence of a number of species observed
in each plot along an environmental gradient). The CA technique is also a tool of prime
importance for the analysis of textual data which are constructed contingency tables relating
the use of several words between different speech texts. The CA can be interpreted as a
complementary technique and sometimes additional use of log-linear models for the
analytical study of the relationships contained in contingency tables. The CA explores these
relationships graphically.
The CA is representing the rows and columns of a table of two-way categorized variables as
points in a low-dimensional Euclidean space (usually two dimensional).The purpose of use
is similar to principal component analysis for continuous data, differing from this by the fact
that the CA operates on the matrix of deviations Chi square instead of using a covariance
matrix.
203

Multivariate statistics

The rows of the contingency table can be seen as points with coordinates given by the
columns of the table. Row profiles are built from the division of the observed frequency in
each cell by the corresponding row total. At every point row is assigned a weight by
dividing the total by the grand total row of the table. The column profiles are defined as
equivalent. The CA determined through singular value decomposition of the matrix of chisquare deviations of proportions rows and columns on the assumption of independence
between rows and columns, an optimal subspace for the representation of row and column
profiles weighted by their respective weights.
When the CA is done on a single two-way table is called Simple Correspondence Analysis
(SCA). This analysis graphs bivariate observations in plans and identifies the heaviest
associations between patterns of two qualitative variables. Multiple Correspondence
Analysis (MCA) explores multidimensional tables. Multivariate observations are plotted on
maps in order to identify associations between the heavier forms of several qualitative
variables. For this last approach uses the Burt table containing known levels or modalities of
each variable categorized in both the rows and columns in the table and therefore contain all
two-way cross-classifications of the original variables (Greenacre, 1984).
The CA operates on a matrix of chi-square deviations, instead of using the covariance
matrix as principal components analysis does. This method measures which are
combinations of modalities that have more inertia (which contribute most to reject the
hypothesis of independence between the two variables). They are the modalities of the
periphery or arrangements to be away from the center of the plane. As the analysis is done
on the absolute frequencies but on the proportions of the contingency table, the term
commonly used to denote inertial information in the Chi square table (inertial is the chisquare value divided by the grand total of the table).
For the matrix of deviations per cell resulting in a set of eigenvectors and eigenvalues are
used to construct an optimal subspace for the representation of row and column profiles
weighted by their respective weights. The axes are drawn in relation to the Chi square
deviation explained by each. The first principal axis is associated with the highest
contribution on the statistical Chi square contingency table. The first d axes define the
optimal d-dimensional space with d=min (I-1, J-1) with I=number of rows, J=number of
columns. The proportion of total inertia explained by each axis is used as a criterion for
selecting the number of axes necessary for representation.
Similar to principal component analysis, the results can be represented in a biplot to plot the
points rows and columns in the same space (Greenacre and Hastie, 1987). The measure
distances between points ranks the discrepancy between rows profiles. Rows points very
close on the graph have a similar profile row. Distances from the origin indicate the
discrepancy between the rows and the centroid row profiles or row marginal
distribution. The same kind of interpretation can be performed on the column profiles.The
distances between rows and columns points are meaningless, but rows and columns points
fall in the same direction relative to the origin are positively correlated, while those that fall
in opposite directions are negatively correlated. Directions may change if other dimensions
are plotted, so it is important to perform the analysis on a space with high inertia.

204

Multivariate statistics

Principal coordinates analysis


Menu MULTIVARIATE ANALYSIS PRINCIPAL COORDINATES ANALYSIS
allows analyzing the interdependence between categorical variables and finding a graphical
representation of the n individuals that both reflect the distance between them. These
distances can be calculated from the structure of similarities defined by the similarity matrix
S. Unlike principal component analysis requires quantitative variables in principal
coordinates analysis (MDS) can be made with any type of variables, including mix of
variables.
In the Principal coordinates analysis window, should indicate the response variables and
the classification in case there are (optional). In Summary statistics tab there are options to
save the coordinates obtained (axes) as the number of coordinates are indicated. If principal
coordinates are saved will be added new columns to the active table. These coordinates can
be retrieved for scatter plots of the observations. You can ask the standardization of each
variable before analysis (Standardize data), display of the distance matrix (Show distance
matrix) on which the analysis is performed and the minimum spanning tree (MST). The
data can be channeled by rows or columns and also can choose from two distance functions:
Mij=- 0.5*Dij*Dij or Mij=1/(1+Dij) is often used to convert distances similarities in the
context the MSD. If this has indicated a classification criterion Summary statistics tab,
InfoStat allows to choose between position measures as the mean, median, minimum,
maximum, and dispersion such as variance and standard deviation (sd) as statisticians to
summarize information for each variable in each set of records indexed by the criteria
(optional).
Example 12: In a study that aimed to study the foods used as protein sources in diets of
people in European countries, food consumed were recorded. After an initial analysis
indicated that the main cause of variability in patterns was the consumption of meat
products, we want to analyze the distance between countries calculated from 4 variables
that relations with protein sources of meat origin. Data are found in the Proteins file.
Menu STATISTICS MULTIVARIATE ANALYSIS PRINCIPAL COORDINATES
ANALYSIS. In the Principal coordinates analysis window select "Beef," "Pork", "Egg" and "Milk"
as Variables and "Country" as Classification variables. In the Principal coordinates analysis
window was activated Save 2 axes or coordinates. Also Standardized data was activated for the
calculation of distances (given the variable nature the Euclidean distance was selected) on the
standardized data matrix and activated the option MST from which we obtained the following map
which explains 82% of the total variation and the MST in order to identify countries which are
closest in the habit of consuming if you consider these 4 variables.

205

Multivariate statistics
2.16

Hungary
Austria

1.18

GermanyE

Netherlands

Poland

GermanyW

PC 2

Czechoslovakia

Romania Yugoslavia

Denmark

Bulgary

0.19
Ireland

-0.79

Spain

Belgium
Sweden
Switzerland

Portugal

Italy
Russia
Norway
Greece

France
Finland

Albania

England
-1.78
-2.62

-1.20

0.23

1.66

3.08

PC 1

Figure 32: Proyection of multivariate obseravtions (countries) in the first two principal
coordinates, with Minimum Spanning Tree (MST). File Proteins.

Theoretical notions about the principal coordinate analysis


Reduction technique called principal coordinates analysis (MSD) is a form of classical or
metric multidimensional scaling. The multidimensional scaling technique explores the
similarities (or distances) between observations and can display them graphically. It is a
useful technique to show distances between data for which the Euclidean measures are not
appropriate or desired, for some other reason, use an alternative distance measure expressed
as a function of an index of association.
The goal of the technique is to show the relationships between observations, represented by
distances or similarities, in a plane such that the actual distances are preserved as much as
possible. The multidimensional scaling technique uses the distance matrix or similarity to
build the configuration of points in the plane. The MSD operates on a double-centered
matrix of similarities derived from the similarity matrix (also distances) as
follows: Cij = Aij Ai. A. j + A.. . It performs the spectral decomposition of the new matrix,
C, and you get the solution or principal coordinates doing Z = ED1/ 2 , whether C = EDE '
represents the spectral decomposition of C. The eigenvalues of decomposition are the
diagonal elements of the D matrix, each one indicating the amount of variability explained
in the dimension given by the corresponding eigenvector of E matrix.

206

Multivariate statistics

Classification-regression trees
Menu MULTIVARIATE ANALYSIS CLASSIFICATION-REGRESSION TREES to
classify observations as multivariate decision trees.
In the selector variables can specify the dependent variable and the explanatory
variables. To OK a dialog box will appear where you can choose the measure of
heterogeneity within the nodes (H), the minimum size of the node to continue the partition
(n) and the threshold of heterogeneity within the group to finish.
InfoStat provides two measures of heterogeneity within nodes (H): the Deviance,
recommended when the dependent variable is a classification variable and the Sum of
squares of the values of the dependent variable within each node, as usually selected for
continuous variables.
Following is a classification tree for Iris data file. To do this select: Menu STATISTICS
MULTIVARIATE ANALYSIS CLASSIFICATION-REGRESSION TREES. In the
window designates "PetalLen", "PetalWid", "SepalLen" and "SepalWid" (in that order) as
Regressors and "Species" as the Dependent variable. In the next dialog choose the
Deviance as a Heterogeneity measure of node (H) because the dependent variable is a
classification variable (species) and leave the other default options proposed for the
following classification tree:

Figure 33: Classification Tree. File Iris.


As you can see the first separation is performed based on variable values along the lower
petal and equal to 2.45 (50 individuals) and the highest at 2.45 (100 individuals). This
branch separates based on the width of the lower petal and equal to 1.75 (54 individuals)
and those over 1.75 (46 individuals) and the process continues.

207

Multivariate statistics
Note: This data file is a particular event occurs: the two variables petal width and petal length
separated the first node in the same way as the species 1 is characterized by larger petals than the
other two species in both Petal dimensions (length and width). Situations like these are resolved by
InfoStat choosing between those variables with the same potential for discrimination, which is first
in the list of explanatory variables.

It is recommended to enter the variables according to the F statistic that can be obtained
with a univariate analysis of variance. In this case the largest F-statistic corresponds to the
length of the petal. That is why it has as the first regressor variable.
Results are obtained automatically in the tree and the history of formation of nodes,
including the values of the statistic used for classification and critical points or values of the
variables associated with each node.

Theoretical notions of classification and regression trees


Models based on regression and/or classification trees are an alternative to the additive
linear models for regression problems and additive logistic models in classification
problems. These models are designed to capture nonadditive behavior; standard linear
models do not allow interactions between variables unless you specify a multiplicative
form. In certain applications, especially when the group of predictors contains a mixture of
numeric variables and factors, the tree-based models are easier to interpret and discuss the
linear models. Tree models are called because the original method of presenting the results
is in the form of binary tree. When the dependent variable is continuous are formed
regression trees and when are of classification, classification trees are generated.
A classification or regression tree is a set of many rules determined by an adjustment
procedure by binary recursive partitioning, where a data set is successively partitioned.This
technique is related to the divisive cluster. Initially all objects are considered as belonging to
the same group. The group split into two subgroups from one of the explanatory variables
so that the heterogeneity at the level of the dependent variable is minimal according to the
selected measure of heterogeneity. The two subgroups (nodes) formed were separated again
if: 1) there is sufficient heterogeneity to produce a partition of observations and/or 2) the
size of the node is above the minimum established to continue the algorithm. The process
stops when not fulfilled these conditions. In each instance of separation, the algorithm
analyzes all the explanatory variables and selects, for the partition, one that allows groups to
form more homogeneous within and heterogeneous between them.

Biplot and MST


Scatter plots are used to directly visualize the observations or variables, the relationships in
other dimensions, are only implied. Biplots graphs proposed by Gabriel (1971.1981) show
the observations and variables in the same graph, so you can make interpretations of the
joint relations. The prefix "bi" in the name biplot reflects the characteristic that both
observations and variables are represented in the same graph.
In the biplots, the observations are usually plotted as points. The configuration of the points
is obtained from linear combinations of original variables. The variables are plotted as
208

Multivariate statistics

vectors from the origin. The angles between the variables represent the correlation between
variables.
The dimensions selected for the biplot are those that best explain the variability of the
original data. To find optimal axes for plotting observations and variables in a common
space is used the idea that any np data matrix, can be represented approximately in d
dimensions as the product of two matrices, A (nd) and B (pd), and d is the range of the
original matrix and AB 'approximates the original matrix. Because A and B have a common
basis of d vectors, you can display the rows and columns of the original matrix on the same
chart with various optimality conditions and the possibility of interpretations of the
distances between points.
The biplots can be considered as a dimension reduction technique because the rows of A
represent the observations in a smaller space (point line) and the columns of B' represent the
variables (points columns) in the same space. If the singular value decomposition of X is
X = UDV ' , where U is np with orthogonal columns, V is the pp orthogonal matrix and
Dr a pp orthogonal diagonal, the A and B matrices can be expressed as
A = UD y B = VD1

Where is usually equal to 0, , or 1 to provide optimality conditions in graph (Gower and


Digby, 1981). The Biplots are scatter plots of the n+p vectors of A and B in the same ddimensional space. Two-dimensional graphics are commonly selected both U and V
components associated with the two highest singular D values. These plots are
approximations of the original matrix unless all variability is explained by the first two axes.
In the biplots the distance between symbols representing observations and symbols
representing variables has no interpretation, but the directions of symbols from the source if
they can be interpreted. The observations (dots rows) are plotted in the same direction as a
variable (dot column) may have relatively high values for that variable, and low values of
variables or columns that are plotted points in the opposite direction. Depending on the
optimality conditions specified, the distances between the rows or columns points can be
statistically interpreted, the angles between the vectors representing the variables can be
interpreted in terms of correlations between variables and the lengths of the rays can be
proportional to standard deviations. When the lengths of the vectors are similar the graph
suggests similar contributions of each variable on the representations made.
InfoStat can obtain biplots from the data matrix and other matrices in the context of various
multivariate analysis procedures as in the case of biplots requested from the PCA and from
the DA. When Biplots are orders from the main menu of multivariate analysis, the user can
modify the value. In other cases, InfoStat assigns the value to the coefficient, in this case
biplot is known as symmetrical biplot. With equals to 0 and 1 you get the best
performances of the column space and row space, respectively.
If you check the MST, the user can visualize on the same biplot graph requested a minimum
path tree. Travel trees are constructed by joining points representing in multivariate
observations and projected on a plane as a result of any dimension reduction technique. The
points are connected with straight line segments such that all points are linked directly or
209

Multivariate statistics

indirectly and there are no loops (Gower and Ross, 1969). The minimum path tree is a tree
path with segments connected so that the sum of the lengths of all segments is minimal. The
minimum path tree can be calculated from the distance matrix of multivariate observations
in the p-dimensional space in which they live or from distance matrices in smaller
spaces. When p-dimensional points (with p>2) are connected, in the plane, according to
their distance in the original space, the minimum path tree can provide information on
similarities of the observations in other dimensions not directly represented in the plane. For
example, some points are very close in the two-dimensional space could be, in its original
space, the farther away than they appear on the map. Minimum path trees are linked to
conceptual clustering algorithm known as single chain and as such they are used not only
for graphical representation but also to form clusters of points. In the window of Graphical
tools, InfoStat also presents the option MST, linked to any series plotted, but in this case the
generated tree is obtained from the distance matrix of two-dimensional points being
graphing. This option can only connect the dots as a function of the distance that the user is
viewing on the plane.

Generalized Procrustes analysis


The geometric configurations obtained through multidimensional scaling, principal
coordinates or other similar techniques offer a more traditional ways of representing the
structure and empirical relationship of a set of elements or individuals which have been
observed several attributes simultaneously. In many cases, the orientation of the scale is
arbitrary, and when they have obtained several configurations on the same sample of items,
either because they were conducted at different times or by different observers involved or
because they used different techniques to make the order, requires a technique for analyzing
the consistency of these configurations. Procrustes analysis is used for this purpose.
Bramardi (2001) says that the word Procrustes was first used to describe the harmonization
and alignment settings, referring to a term of Greek origin meaning "hammer" and refers to
a mythological innkeeper, who stretched or cut the limbs to his guests so that they coincided
with the bed where they lay. Initially Procrustes analysis was used to adapt or adjust a
setting to another and represented. Described the adaptation of settings as a transformation
in which a matrix is rotated, and constrained by specifications of a matrix established at the
called target matrix. The transform matrix must match as much as possible with the
objective matrix, this is what is known as Procrustean transformation. The proposed method
is restricted to matrices with the same number of columns and full rank and is based on a
least-squares criterion which minimizes the distances between similar points in the final
configuration. Under rotating a matrix approach to adjust to another, you can rotate multiple
matrices to a common centroid matrix. This is what is known as generalized Procrustes
analysis. Gower (1975) describes the centroid matrix as representing average consensus
configuration and includes the translation and scaling matrices after their standardization in
its analysis, proposing an estimation technique that culminates in a form of analysis of
variance.
The estimation technique for generalized Procrustes analysis developed by Gower proposes
the harmonization of individual settings through a series of iterative steps by successive
210

Multivariate statistics

these. Los successive steps or transformations are performed in a generalized Procrustes


analysis included standardization rotation, reflection and scaling of the data under two
criteria: (1) distances are maintained between individuals of the individual settings, and (2)
to minimize the sum of squares between similar points, i.e. for the same item, and its
centroid. The consensus configuration is obtained as the average of those entire individual
settings transformer.
In matrix terms, if each individual matrix is represented by Xi, (i=1, 2, ..., m) with n rows
i
and p columns where the j-th row gives the coordinates of a point (individual) Pj( ) referred
to p axes, scaling, rotation, and translation can be expressed algebraically by the
transformation

X i i X i H i + Ti
where the orthogonal matrix of rotation Hi, pi scale factor and translation matrix Ti, will find
themselves in a manner that minimizes
=
Sr

( P( ) , G )
n

=j 1 =i 1

where (A,B) is the euclidean distance between pair of points A and B, and Gj is the
i
centroid of the m similar points Pj( ) (i=1, 2,...,m)
In summary, the generalized Procrustes technique provides a method to agree arrangements
that involves three actions:
1. Translation (centering of the ordinations)
2. Rotation (in order to minimize the difference between ordinations)
3. Scaling (multiplying by factors to minimize differences in size).
Example 13: procrustes file contains a data set of 23 individuals for which there were two
types of information, genetic and phenotypic. The genetic data comes from DNA markers
and are binary type, while phenotypic data are continuous variables. We worked on a set of
239 molecular markers (genetic data) and two phenotypic variables (plant height and
diameter at breast height or DBH). By Procrustes analysis seeks to quantify the agreement
between the management of individuals obtained by principal coordinate analysis of the
matrix of genetic distances and management of the same individuals using the phenotypic
data. At first requested by principal coordinates analysis using data from the 239 genetic
variants and kept the 22 axes resulting.
Menu STATISTICS MULTIVARIATE ANALYSIS PROCRUSTES
GENERALIZED, provides a window where you select the variables of the analysis in this
case the 22 principal coordinates and the two morphological variables (on this kind of
variables not perform a dimension reduction technique since it was only two). Then,
variables should be grouped according to the type of information provided in this example,
the principal coordinates (from molecular data) were assigned to one group (Group 1) and
variables that contain morphological information to another group (Group 2). To perform
211

Multivariate statistics

this assignment, the variables must be selected with right mouse button to indicate new,
InfoStat automatically placed in the right window all the variables selected in the same
group, then turns to perform the operation for assigning the remaining variables to another
group. In the second instance, the window is as follows:

If you select the option MST, InfoStat will graph a route tree on setting minimum
consensus.

212

Multivariate statistics

Figure 34: MST of original configurations and consensus configuration. File Procrustes.idb.

213

Multivariate statistics
Table 64: Generalized Procrustes. File Procrustes.idb.
Analysis of variance table
By case sum of squares
Consensus residual Total

Prop Cons

Analysis of variance table


By case sum of squares
Consensus
residual
Total
A1
0.026
0.023
A2
0.052
0.029
A3
0.114
0.052
A4
0.065
0.039
A5
0.069
0.040
A6
0.113
0.057
B1
0.041
0.029
B2
0.032
0.020
C
0.122
0.050
Z
0.043
0.027
N1
0.034
0.025
N2
0.025
0.023
R1
0.046
0.031
R2
0.024
0.023
R3
0.022
0.022
R4
0.108
0.055
R5
0.071
0.041
R6
0.034
0.027
R7
0.027
0.023
T1
0.053
0.034
T2
0.041
0.018
V1
0.029
0.023
V2
0.067
0.033
Total
1.257
0.743

0.049
0.081
0.166
0.104
0.109
0.169
0.070
0.052
0.173
0.070
0.059
0.048
0.077
0.046
0.044
0.163
0.111
0.061
0.050
0.088
0.060
0.052
0.100
2.000

By group sum of squares


Consensus
residual
Group1
0.628
Group2
0.628
Total
1.257

1.000
1.000
2.000

Total
0.372
0.372
0.743

The consensus among management matrix produced by genetic data obtained from
morphological data is 63%, this is obtained by dividing the total (2) for total consensus
obtained (2/1.257=0.6285).

214

Time Series

Time Series
The Time Series module of InfoStat deals with the analysis of observed data in sequence at
regular intervals of time. We have implemented two approaches for modeling classic series
and forecast future values of the series: 1) deterministic smoothing techniques (see
Statistics, softened and adjustments) and 2) techniques based on models of Box and Jenkins
ARIMA time series (1976).
In this version, InfoStat allows the analysis of univariate time series (the realization of a
stochastic process defined on real numbers). When in the data table, the number is entered
as a column, InfoStat interprets the time sequence of data is given by the order in which they
were admitted (cases). The user can alternatively use an additional column in time to index
the comments in the column that contains the series (dated).
The proper modeling of time series, in the case of deterministic smoothing techniques
depends strongly on the choice based on critical assessment of the user, or the parameters
that govern the smoothing.
InfoStat support series with missing data. The user can request automated prediction of
missing data.
The size of the data file,
i.e. the number of cases
and the number of sets to
work on this module only
depends on the ability of
RAM in the personal
computer
that
runs
InfoStat.
InfoStat offers in this
module the possibility of
building
commonly
graphics used to represent
time
series,
without
recourse to the Graphs
menu.
In addition, the user has
facilities to simulate a wide
range of processes that
generate
time
series,
including stationary stochastic processes and non-stationary, seasonal and non
seasonal. The simulated series are automatically stored as columns in the active data table.
In the case of ARIMA models of Box and Jenkins, InfoStat offers the user to work on the
three fundamental steps of identification, estimation and validation. For any of these
strategies, forecasts of interest can be constructed quickly and capabilities highly

215

Time Series

informative graphical representation. InfoStat anbles to estimate the impulse response


function and its confidence bands to describe the behavior of process against an instant riot.
Menu STATISTICS TIME SERIES access to the submenus shown below:

Simulation and transformations


En el bloque Herramientas de submenues, InfoStat permite generar series de tiempo por
simulacin, transformar series, graficarlas, realizar la prueba de races unitarias y obtener
funciones de correlacin cruzada entre dos
series de tiempo.
In the Tools submenu block, InfoStat can
generate time series for simulation,
transforming series, graphs, perform unit
root test, and obtain cross-correlation
functions between two time series.
Invoking the To generate series submenu is
activated time series generator window
with two flaps, ARMA (p, q) and Dater. On
the flap ARMA (p, q) must specify the
parameters of the model that simulates the
process generating a time series of T
observations. On the flap Dater can indicate the structure that indexes the time series
generated (frequency, period, start and end of time series generated).
The models are the models can simulate stationary ARMA (p, q), non-stationary ARIMA (p,
q, d) (see menu ARIMA Box Jenkins Methodology), and conditional heteroskedastic
GARCH (p, q) (see Hamilton , 1994). When the user selects the type of model you want to
use is automatically presents the appropriate model equation and the explanation of the
terms and parameters to be specified. In the figure above shows the model equation ARMA
(p, q) and the terms and parameters specified for it.
In the Name column, InfoStat suggest a name for the series to generate the user can
change. In the previous figure InfoStat are suggesting naming the series as "3 series" as the
active data table and had to use the first two columns.
Number of k columns field must be specified how many time series you want to
generate. Number of T Observations field must be entered length/s Serial/series generate.
P and q buttons enable to specify the orders allow portions autoregressive (AR) and moving
means (MM), respectively. Associated with each buttom there is a grid to enter the
parameters. In the previous figure, as p equals 1 and q is zero, only displayed the grid of
parameters to be specified for the AR portion. As p and q are increased, the grid will enable
as many rows as parameters should be specified by the user. , and 2 fields must be
completed with the values of the parameters "constant", "trend" and "variance" of the
generating process.
For the ARIMA (p, q, d) and GARCH (p, q) the user should proceed in a similar manner to
that described for the model ARMA (p, q).

216

Time Series

Invoking the Series transformations submenu will be presented on screen selector


variables where the user must specify what is or are the columns of the file (series) you want
to transform. Selecting a series will be presented Series Transformations window, which
shows the number chosen to apply the transformation and the name of the column that
InfoStat awarded to the transformed series. This name can be changed by the user
positioned on the name to edit.

Possible transformations to be used include: delays with user-specified lags, differences,


integration, logarithmic transformation, and seasonal filters (Pindyck and Rubinfeld, 1999).
To make two or more differentiation, the user must repeat the transformation many times as
necessary. In the case of applying a seasonal filter enables a field to indicate the amplitude
of the cycle (monthly, 1, bi, 2, ..., annual, 12).
Invoking the To Graph series submenu will be presented on screen selector Variables
which the user must specify what is or are the columns of the file you want to graph,
optionally the user can select a proxy for the time to associate the X axis. When OK,
InfoStat automatically builds scatterplot time series and presents the graphical tool window
from which the user can customize the graph to its needs. This window lets you apply
common graphics functions of Graphics menu in InfoStat with few exceptions such as those
that specify the number of points to be shown on the X axis (time series there is usually a
high number of values for X, InfoStat allowed to decide the values of the scale will be
visible through the specification of the number of observations will fall between two ticks).
Figure 35 shows the graph obtained by default for a series generated by simulation for an
AR(1) with zero constant without trend and variance one.

217

Time Series

Figure 35: Serie for an AR(1) model generated by simulation.

Unit root test


By invoking the unit root test submenu will be presented on screen selector Variables
which the user must specify which column of the file (series) you want to test. When OK;
InfoStat presents the value of several statistical evidence and the probability value
associated with each of them. The null hypothesis postulates that the series has a unit root,
the alternative that provides for the absence of unit root, i.e. the stationarity of the process.
InfoStat implements three statistical tests for the unit root hypothesis. These are the DickeyFuller test (1979), Dickey-Fuller (1981) and Phillips-Perron (1988) (see Hamilton, 1994).
Dickey and Fuller noted that under the alternative hypothesis, when there is a single unit
root, the series is difference stationary. Symbolically,

yt 1 + ut , | |< 1
yt =
yt yt 1 =( 1) yt 1 + ut ,
y=
yt 1 + ut , < 0.
t
Then, the unit root hypothesis can be rewritten as:

H0 : = 0
H1 : < 0
To perform the test of Dickey and Fuller, InfoStat estimates using ordinary least squares and
generates the associated probability values from the empirical distribution of the statistic
obtained by Monte Carlo simulation. The distribution depends on the existence of a
constant, a trend or two in the model, so InfoStat presents the evidence in all these situations
automatically. It is important to note that if yt AR ( p ) with a unit root then the
differentiated series yt AR ( p 1) . Therefore, the test of Dickey and Fuller is made
based on the OLS estimation of the following model:

218

Time Series
p 1

y=
yt 1 + j yt j + ut
t
j =1

As for the basic test of Dickey and Fuller, p values reported here are derived from the
empirical distributions of the statistic derived by Monte Carlo simulation, which depend on
whether the upgraded model has a constant, trend, or both. In the test proposed by Phillips
and Perron tests are corrected Dickey and Fuller for situations where the error terms are
serially correlated and/or are heteroscedastic.

Cross-correlations
The Cross-correlations submenu shows Series selector where the user must specify what
are the two series for which you want the cross-correlation function. The cross correlation
function shows the correlation between both series for different lags of the second series
with the first. Conceptually, the cross correlation function is analogous to the
autocorrelation function, except that the correlations are not obtained from observations of
the same series but between two different series. Lets get the cross correlation function
between two time series. The cross-correlation for k lag measures the magnitude of the
linear correlation between the values of the first series and the values of the second set, k
periods ahead. InfoStat consideres as the first series to the current table column has been
selected first in the window selector variables. The resulting cross-correlation function is
automatically displayed in the results window and on the graph. In the Output window
appears, as well as the correlation for each lag, its standard error, t statistic and p value
hypothesis test of zero correlation for the lag. In the Graph window are also shown
confidence bands 95% for cross-correlation function. The user can change the confidence
level for these intervals. The cross correlation function between two series is often used to
determine whether the second series could help predict the former.
The cross covariance function between two time series x1,t and x2,t , t = 1,..., T for the lag
k, denoted as C12 (k ) , is estimated as follows InfoStat,

1 N k
([ N / 4] + 1),.., 1, 0,1,...,[ N / 2] + 1 ,
C12 (k ) = x1,t x2,t k , k =
T k t = 1+ k
where [.] represents the integer part of the argument. By dividing C12 (k ) by the square
root of
the
product
of the auto-covariance functions of
each series at lag zero,
1/ 2
(C11 (0)C22 (0)) , we obtain the coefficient of cross correlation function for the sample
data,

C (k )
12 (k ) = 12
(C11 (0)C22 (0))1/ 2
Using the data file CrossCorr.idb, InfoStat generates the following output and plot crosscorrelation function.

219

Time Series

Table 65: Cross-correlacin function between x and y series


Funcin de correlacin cruzada

Informacin general
Serie Nro.Obs.
X
10
Y
10

Media
14.80
13.70

Var(n-1)
6.84
4.23

D.E.
2.62
2.06

Funcin de correlacin cruzada: r(Lag)


Lag
Coef
E.E.
T
p
Signif
-3
0.08
0.52
0.16
0.8728
-2
0.36
0.49
0.73
0.4687
-1
0.56
0.44
1.25
0.2145
0
0.92
0.32
2.90
0.0052 *
1
0.69
0.32
2.18
0.0331 *
2
0.47
0.36
1.31
0.1955
3
0.36
0.43
0.82
0.4156

Cross-correlation function: r(Lag)


1.00

r(X,Y)

0.50

0.00

-0.50

-1.00
-3

-2

-1

Lag

Figure 36: Representation of a random walk (non-stationary process) and its differentiation
(Stationary process).

Power spectrum
By invoking the power spectrum submenu will be presented on screen selector Variables
which the user must select the series for which you want the power spectrum. Spectral
analysis is used in different disciplines to partition the variance of a time series based on the
frequencies. For stochastic time series the contributions of different frequencies to the
variance are measured in terms of the spectral density or power spectrum.
The word "spectrum" comes from the lens. The red and blue colors of the electromagnetic
spectrum are often used to describe the frequency distribution of the spectrum. A
spectrum whose spectral density decreases with increasing frequency is called a red
spectrum, by analogy with visible light, where red is long wave lengths (low frequencies).
Similarly, a spectrum whose magnitude increases with frequency is called a "blue

220

Time Series

spectrum". A "white spectrum" is one in which the spectral components have approximately
the same amplitudes of the frequency range lard. Thus, time series have long periods of
variability tend to have red spectra, as in the case of long-term economic cycles, while
white ghosts usually appear in the measurement errors of laboratory instruments.
InfoStat estimates the power spectrum using a nonparametric spectral method based on the
Fourier transform of the auto-covariance function of a time series, which is given by:
Gk
=

C yy (m)e i 2 km / M , k
=

0,..., T / 2,

m =0

1 T m
yt yt m is an estimator (bias) of the autocovariance function,
T t =1
estimated for u total of M or delays. Using this estimator considered biased because the
estimated spectral density by fast Fourier transform (FFT) of the original series Emery and
Thompson, 1997). As is an even function, the spectrum is estimated in InfoStat by
processing cosine tranformation
where

C yy (m) =

Gk =
C yy (0) + 2 C yy (m) cos
m =1

2km
, k=
0,..., T / 2,
T

frequencies
where Gk focuses on positive
f k = k / T and
0 f k fT =
1/ 2 interval is divided into T/2 segments (T is even).

the

Nyquist

Especpote.idb file contains data from monthly average temperatures (degrees Celsius) of the
sea surface at some point with coordinates 55.16 48'N, 125 32.17', taken from January
1982 to December 1984. The power spectrum estimated using InfoStat is played in the
following table:
Table 66: Power spectrum analysis
Espectro de potencia
Informacin general
Serie Nro.Obs.
SST
36
Espectro de potencia
Frecuencia
Coef
0.000
0.000
0.028
5.292
0.056
1.866
0.083
62.934
0.111
0.455
0.139
0.002
0.167
2.750
0.194
0.030
0.222
0.151
0.250
0.103
0.278
0.072
0.306
0.103
0.333
0.267
0.361
0.302
0.389
0.031
0.417
0.016

221

Media Var(n-1)
10.811
4.274

D.E.
2.067

Time Series
0.444
0.472
0.500

0.045
0.177
0.401

The graph of the power spectrum of this problem is:

Figure 37: Power spectrum representation for SST series


Table and graph shows a peak in the spectrum centered on the annual frequency, because
the monthly scale of the graph corresponds to a frequency of 0.083 cycles per month
(0.083x12=1 cycle per year).

Box and Jenkins methodology (ARIMA)


InfoStat can apply the methodology proposed by Box and Jenkins (1976) to identify, assess
and validate autoregressive models integrated moving average (ARIMA). ARIMA model is
an algebraic expression that defines how the observations on one variable at a given moment
in time, are statistically related to observations recorded in the past on the same variable.
ARIMA model building requires working with enough information. As in any modeling
process is a good one most parsimonious model (lower order) showing a good fit to the data.
The process yt is represented ARIMA (pq, d) if the difference of d order of it, d yt is
represented ARMA (p, q). Moreover, yt has a representation ARMA (p, q) if

( L)
t , where ( L) is a AR polynomial of q order, ( L) is a MA polynomial of p
( L)
order and t is the error term in the t-th time (Box and Jenkins, 1976). If q=0 then we say
that yt AR ( p ) and if p=0 that yt MA(q ) .
yt =

222

Time Series

For example, the stochastic process { yt } should be represented at AR(1), MA(1) or ARMA
(1.1) it can be written as:
(a)
(b)
(c)
respectively,

=
yt yt 1 + t ,
y=
t + t 1 ,
t
y=
t yt 1 + t + t 1 ,
where the

sequence { t } is a martingale in dispute (MDS) conditionally

homoscedastic with unconditional variance

(Hamilton, 1994).

A stationary series is basically one that has mean, variance and autocorrelation function
constant over time. Longest series should be used when you cannot sustain the assumption
of stationarity of the process. Non stationary series can be transformed, by simple functions
in stationary series. InfoStat allows the transformation difference of series unlike prior to
construction of the model.
The Box-Jenkins methodology is based on the following steps:
Based on autocorrelation functions and partial autocorrelation choose p and q (step
known as model identification)
Estimate the parameters (using ordinary least squares, maximum likelihood, etc.).
Make a control diagnosis on residuals model (looking for lack of serial correlation,
normality, homogeneity of variance and stationarity).
If the diagnosis suggests a good fit of the proposed model can make predictions. Otherwise,
repeat the assessment and diagnosis with different values of p and q.
As tools for identifying ARMA models, InfoStat provides the possibility to obtain:
Observed series plots vs. time. From these, you should answer if it is necessary to
differentiate and/or remove trends. If difference, the model will be estimated on the
difference series. The differenced series should be stationary.
Example 14: The following graph shows a series of 40 observations of a non stationary
process (random walk) and the differentiated series (random walk system). The series was
not stationary was simulated with InfoStat using the following model
yt =yt 1 + ut , ut =0.3ut 1 + 0.5 t + t , t ~ iid N (0, 2 ) . Differentiation produces a
stationary series.

223

Time Series
Caminata aleatoria diferenciada

2.46

-3.66
Dif_yt
-9.78
Caminata aleatoria
-15.89

-22.01
1

10

13

16

19 22
Tiempo

25

28

31

34

37

40

Figure 38: Graphical representation of a random walk (non-stationary process) and its
differentiation (Stationary process).
Example 15: If the process that generates the data shows a trend could suggest the
following model:

yt = 0 + 1t +

( L)
ut ,
( L)

The following graph shows a series generated in InfoStat from the model
yt =+
5 1.5t + ut , ut =
0.3ut 1 + 0.5 t 1 + t , t iid N (0,3) and the pure deterministic
trend:

Figura 39: Serie generada a partir del modelo del Ejemplo 42

In cases like this example, you should remove the "trend" by estimating in the first instance
of the parameters of the regression yt = 0 + 1t + t and then assume that:

224

Time Series

yt 0 1t ARMA( p, q )
Graphic representations of the autocorrelation functions and partial autocorrelation:
The Identification submenu (autocorrelation: FAC and FACP), can be automatically each
univariate series which model should be identified, the autocorrelation function (ACF) and
partial autocorrelation (PFAC).
The autocorrelation function is a measure of how correlated the observations within a time
series. From these functions, you can summarize the significant statistical relationships
through the selection of one of the ARIMA models in the ARIMA model as each has a pair
of autocorrelation functions and partial autocorrelation associated.
If

is the j-th autocovariance and 0 is the variance, then j =

j
0

is the j-th

autocorrelation can be estimated as follows:

j =

T
j
1
, j =
( yt y )( yt j y ) .

T j t = j +1
0

The plot of autocorrelation coefficients for different lags provides a representation of the
autocorrelation function. The behavior of the autocorrelation function can be described as
follows:
If the process is a pure AR, the autocorrelations decline exponentially
If the pure MA process is a decline to zero rapidly, depending on the order q of the process;
If the process is an ARMA (p, q), will decline rapidly.
Example 16: We present the fac of a simulated process AR (1). The series was generated
using the stochastic generator of InfoStat with a coefficient auto-regressive AR(1) equal to
0.75. The bars represent the coefficients of autocorrelation and dash above and below the
reference line the limits of 95% confidence interval for the population autocorrelation
coefficient.

Figura 40: Funcin de autocorrelacin.

225

Time Series

To conceptualize the partial autocorrelation function (PAF) is suppose that yt is of

is yt 11 yt 1 + t . After 11 is
stationary covariance, and regression fits of yt on yt-1, that =
the first partial autocorrelation (which can be estimated by OLS).

yt sobre yt 1 e yt 2 will be the second partial


autocorrelation 22 : yt =21 yt 1 + 22 yt 2 + t , and so on.

When performing the regression

The graph of the partial autocorrelation coefficients for different lags provides a
representation of the partial autocorrelation function. The behavior of the partial
autocorrelation function can be described as follows:
If the process is a pure AR, partial autocorrelations decline rapidly, depending on the
process orden p;
If the pure MA process is a partial autocorrelations decline to zero smoothly
If the process is an ARMA (p, q), will decline rapidly.
Example 17: We present the fap of an AR (1)process with autoregressive coefficient equal
to 0.75, generated using the simulator in InfoStat. The bars represent the sample partial
autocorrelation coefficients and upper and lower lines that demarcate the boundaries of the
95% confidence interval for the population partial autocorrelation coefficient.

Figura 41: Funcin de autocorrelacin parcial.


In practice to estimate T/4 autocorrelations and partial autocorrelations by OLS can be
computationally expensive. InfoStat makes this estimate by the recursive algorithm of
Durbin (1960), based on solving the system of Yule-Walker equations:

11 = 1

226

Time Series
k 1

k k 1, j k j
j =1

=
kk

(k 2,3,...)
=
k 1

1 k 1, j j
j =1

where k , j =
3, 4,...; j 1, 2,..., k 1) .
k 1, j k ,kk 1,k j (k ==
Information Criteria: In addition to the tools of the above identification process, InfoStat
automatically calculates the estimation process statistical information criteria (SIC), which
can be used both for identification and for diagnosis.
The idea behind these criteria is to find p and q so as to minimize the residual sum of
squares related to penalty terms that depend on p and q to reduce the potential for incurring
over fitting or adjustment little parsimonious model. Available criteria differ in the way that
penalizes the over-parameterization.
The methodology based on the SIC as an identification tool (to choose p and q) suggests:
(1)
(2)

(3)

Choose P y Q the maximun posible for p and q (information that can be obtained
from fac and pfac jointly). In practice it is convinient to take p+q<4.
For each model ARMA( p, q) tal que p P y q Q estmate the parameters and obtain
the residuals ut (p, q ) .
Calculate 2 ( p, q ) = T 1

ut2 ( p, q) .
t =1

(4)

Calculate the chosen SIC.

(5)

Choose p and q that minimizes the chosen SIC.

Information criteria implemented in InfoStat are:

AIC ln 2 ( p, q ) +
=
Akaike:

2( p + q)
.
T

BIC ln 2 ( p, q ) +
=
Schwartz:

( p + q ) ln T
.
T

ln 2 ( p, q ) +
Hannan-Quinn: HQ =

227

c( p + q) ln(ln T )
, c > 1.
T

Time Series
Note:
(1) For large T, AIC will choose p and q "right", but can over-parameterize for T not as big, if
AR(2) is the model, for example, choose p=0 or p=1, but yes.
(2) In finite samples BIC tends to underestimate p and q, since the penalty term tends to dominate.
(3) HQ solves (or tries to solve) the problem of finite sample BIC.

The Estimation, validation and forecasting submenu implements the second and third step
of the methodology of Box and Jenkins once the model has been identified. By invoking this
menu will appear a window with two tabs: Model and Estimation, validation, forecasting,
missing data, others tools. The Model tab allows you to specify the model equation to be
estimated. You can select stationary ARMA models and non-stationary ARIMA
models (in which case InfoStat requires that you enter the value of d, the parameter of
differentiation used.) The models to estimate can be no seasonal (no trend) and seasonal (in
this case InfoStat will ask to introduce the frecuency that describes the seasonality; InfoStat
also supports the presence of minor cycles within a larger cycle of seasonality).
According to the specifications made by the user is shown the model equation to estimate
and can be read on the screen the details of the terms of the model. In addition, many fields
will be activated as parameters must be estimated for the user to enter the initial values of
the same to be considered in iterative estimation process. It is suggested in a first step to
activate the field related to the parameter (hope of the series), in which case InfoStat
assumes as a starting value for this parameter in the process of estimating the rounded value
of the arithmetic mean of the series.

InfoStat used to estimate model parameters by maximum likelihood conditional on the first
p observed values of the series (Hamilton, 1994). The flap Estimation, validation,
forecasting, missing data, other tools can carry out on the estimation procedures, indicate

228

Time Series

which are the tools used to diagnose the estimated model and make predictions when the
model has been validated.

The likelihood function is optimized numerically with some of these algorithms (Alg.
Numeric Estimate), as indicated by the user: Nelder and Mead, Powell, Fletcher and
Reeves, and Polak and Ribiere. The first two are based on the proceedings Downhill and
the last two are based on the gradient method (Press et al., 1986). It is recommended in the
first instance select the Nelder and Mead algorithm with initial values suggested by
InfoStat. Later you could apply the Polak and Ribiere algorithm using as initial values those
obtained in the previous setting.
The maximum number of iterations (Max.Iter) must be entered by the user (InfoStat has
1000 by default) and the output should check that the algorithm has converged naturally (i.e.
the number of iterations performed was less than Max . Iter). In the case of not achieving
convergence (number of iterations equal to Max.Iter) should be re-run the procedure,
preferably from another set of initial values and/or increasing the value of Max.Iter.
Iteration Results field iterations may display the values of the objective function at every
step of the numerical optimization selected. If the Sheet of Results field is activated,
InfoStat estimates the model that the user selected in the Models tab and presents the results
of this estimate in the corresponding output. If the ASCII File field is entered a name,
InfoStat keeps a text file with the name entered by the user.
In estimating a model is possible to save residuals (the difference between the observed and
predicted by the model estimated) predicted values by the model and the coefficients of the
impulse function response in the active table.
The impulse-response function estimates coefficients of the Wold representation (Hamilton,
1994). It was named to the graphical representation j ' s (coefficients of impulse-response
function) versus the index j. Its interpretation is important in applied studies in economics as
describing the effects of a single shock on the instantaneous and time series under study.

229

Time Series

Conceptually Wold decomposition states that all stationary time series is associated with an
infinite representation MA. The decomposition theorem of Wold (Hamilton, 1994) states

that if E (yt ) 0, yt is a covariance stationary process, then yt = j t j + dt


j =0

where { j } are constants,

t =
yt E ( yt | t 1 ), i.e. t ~ MDS and such that:
2
E{ t s } =
0

s=t ,
st

E ( t =
yt j ) 0

) 0
E ( t d s=

t , j > 0 , and

s t.

As tools to validate the identified and estimated model, the user can perform different types
of operations on the model residuals (Residual Validation). If the fitted model isgood, the
residuals should form a sample from a normal distribution with zero mean and constant
variance and do not possess any type of serial correlation. To analyze the residuals, InfoStat
allows for the statistical MAD (median absolute deviations of the residues on their median),
Range (difference between minimum and maximum number of residues), Kurtosis and
Skewness (should show values close three to zero for the kurtosis and skewness coefficient),
autocorrelation functions and partial autocorrelation of the residual series (should not set
any model) and the normality test proposed by Jarque and Bera (Jarque and Bera, 1987)
(p values lower than the level of significance suggesting the rejection of the null hypothesis
of normal distribution). The test statistic of Jarque and Bera is:

T 2 ( K 3) 2 d 2
JB =
A +
2
6
4

where T represents the size of the sample, A is the asymmetry coefficient defined as

1 ei
1 ei
, K is the kurtosis coefficient defined as k=
and S is the standard
A=
3
T S
T S4
3

deviation.
For the study of serial correlation of residues, also calculated the Durbin-Watson,
expression is:

(e et 1 ) 2

t =2 t
DW =
T
t =1 et2
T

230

whose

Time Series

where et are the residuals of identified and estimated ARMA (p, q) model. It can be tested
that 0<DW<4. A DW value near 0 indicates positive autocorrelation ( > 0 ). A value close
to 4 indicates a negative autocorrelation ( < 0 ). When T is large, DW 2 2 , which
suggests that when is close to zero (i.e. no autocorrelation), then DW will be close to 2.
The DW statistic distribution is unconventional and depends on the explanatory variables in
the ARMA (p, q) model. InfoStat derives exact p values for testing the hypothesis of lack of
serial correlation based on DW statistic.
To test the joint hypothesis that all autocorrelation coefficients are zero, InfoStat also gets
the statistics of Box-Pierce and Ljung-Box (Pindyck and Rubinfeld, 1999). These statistics
have the advantage of not relying on the explanatory variables as with the Durbin-Watson
and can show substantially higher power values of the DW statistic. The Box-Pierce statistic
(asymptotically chi-square) is calculated as:
m

T 2j (2m p q )
j =1

where m is the number of involved coefficients in the trial. If the calculated statistic is
greater than the critical level of 5%, we can ensure (with a significance level of 0.05) than
the true autocorrelation coefficient, 1, ...., m are not equal to zero . The Ljung-Box statistic
(asymptotically chi-square) is calculated as:
m

d
1
2j (2m p q )
j =1 T j

LB =
T (T + 2)

,3 T . As T becomes large happen that the


2

In practice, InfoStat takes m = min

distribution of these statistics is flattened, m2 U (0, ) . To solve this problem, InfoStat


also gets automatically LB standardized test based on the following result:

LB (m p q ) d
N (0,1)
2(m p q )
The Missing data subwindow can be used to predict missing values or to detect and
measure effects of the influence of outliers. If Predict field is not activated, InfoStat
automatically calculates the values found in white on a cell in the series, if it is activated
instead, the user can enter the number of cases you want to predict in the field is
right predict, will be visible to enable a grid where you should place the numbers of cases to
predict. Both strategies, prediction of values in empty cells and cells referenced by the user,
can be implemented simultaneously.
If the Sheet of Results field is activated, InfoStat will estimate the model that the user
selected in the Models tab and will present the results of this estimate in the corresponding
output (Output window). Write grid option when is activated allows to add to the active
table a new column, which contains the complete original series with missing values that

231

Time Series

were just predict. Optionally, the user can specify the name of a text file in the ASCII File
field of the Missing data subwindow, which will save the predictions of missing data.
InfoStat predicts the missing values through the procedure based on the hope that
observation conditionated on the information available. This linear predictor is constructed
by postulating the restrictions proposed by Alvarez et al. (1993). The procedure identifies
the initial values of the series until the first missing data, then predicts all remaining values
(missing information more information is available after missing data).The prediction is
made by imposing restrictions so as to minimize the prediction error information after the
event of missing data (this restriction implies that accurately predicts the following known
data for the missing data). We show that the obtained predictor is the best linear unbiased
predictor in the sense of being less than that of mean square error of prediction if the model
is known (Guerrero and Pea, 2000). However it is important to note that the true model is
rarely known. The assessment of these predictions when the model does not correspond to
that associated with the generator set has been analyzed by simulation (Smrekar, Robledo
and Di Rienzo, 2001).
Forecasts that can be obtained in this subwindow of InfoStat are based on extrapolation
methods and involve the projection of patterns and relationships observed in the past about
the future. Predictions are made from the estimated ARIMA model. The user must choose
which observations of the series should be used to cast the outcome in the Data to use, First
and Last fields.
If you want to predict for example h=4 new series values using the model estimated from
the T observations, the user must enter the number 4 in the No. steps field and 1 and T
numbers in the First and Last fields. Long-term forecasts (number of steps high)
constructed from ARIMA models are less reliable than short-term forecasts.
The possibility of indicating the initial and final values of the data used for forecasting
allows cross-validations of adjusted models in InfoStat. For example, if the last field is
entered the value T-h, InfoStat will use a long series of T-h to estimate model parameters
and automatically calculate the mean square errors of prediction (squared difference
between the observed and predicted from the model ) for the last four observations.
The re-estimation of model with predictive purposes is called calibration. In InfoStat,
calibration can be done by activating the fixed, recursive or mobile field. In the fixed field
of h predicted values are built on the estimated model with data entered by the user in First
and Last fields. In the recursive calibration, the width of window is updated (increased)
every time a new value is predicted. The update is done by adding a series of data to be used
to estimate new data available. In the mobile calibration, the window remains fixed width,
so used to predict T+h observation data from h-1 to T+h-1.
For all predictions, InfoStat allows for the prediction coefficient bands (1-)100%, entered
by the user in the bands at:. Write grid option when is activated in the Forecasts submenu
allows to add to the active table a new column which contains the complete original series
with the values predicted recently. Optionally, the user can specify the name of a text file in
the ASCII File field in Forecasts subwindow which saved the forecasting and prediction
errors in the event that they can be calculated by InfoStat. If the Sheet of Results field is
activated, InfoStat makes forecasts that were requested and submits the results of these

232

Time Series

projections in the Output window. It is understood that the user trying to access this
subwindow is to generate forecasts or projections that are optimal in some sense.
If the forecasts are recorded in the active table, they may be plotted from the graphics
panel.
A function of loss is a function that can describe how costly it will be if the forecast turns
away a certain magnitude or distance from the true value. As a function of loss, InfoStat
reports automatically mean square error of prediction (MSEP), which represents the
objective function to be minimized in the calibration process. Following Hamilton (1994), if
we denote by the forecast based on a set of variables observed in the date or time t, then the
MSEP is:
MSEP= E (Yt +1 Yt*+1|t ) . It is possible to prove (see Hamilton, 1994) that the prognosis of a

=
yt +1 E ( yt +1 | yt t ) and forecast k steps
step that minimizes the MSEP is provided
ahead is such=
that yt + k E ( yt + k | yt t ) .
For the construction of confidence intervals, InfoStat calculates the error of forecast k steps
ahead as:

eT=
yT + k yT + k
+k
= 0 T + k + 1 T + k 1 + + k 1 T +1
with forecast error variance expressed as:

V (e=
E ( yT + k yT + k ) 2
T +k )
= ( 02 + 12 + + k21 ) 2
k 1

= (1 + j =1 j ) 2 .
The estimate is based on the sum of squared residuals obtained after they have made
estimates of ARMA model parameters specified, i.e.
T

=
2

t
t =1

T pq

where the parameters are estimated based on the exact likelihood function

or

2 =

InfoStat).

233

t= p +1

T 2p q

if we used the conditional likelihood function (implemented in

Time Series

Based on this, knowing the shape of the prediction error variance and assuming normality
of the error terms, it follows that the confidence interval (1 ) 100% can be
approximated by:
1/ 2

yt + k

k 1
z 1 + 2j

1
j =1
2

Example 18: On a series of 100 observations simulated in InfoStat specifying an AR (2), we


performed the estimation process of a AR (2) model using the Nelder and Meadalgorithm,
indicating the values 1 and T-h=96 in First and Last fields of the forecast subwindow
(calibration set) and No. Steps h=4. The results are presented in Table 67.
Table 67: Resultados del proceso de estimacin, validacin y pronstico de un modelo AR(2) sobre
una serie AR(2).
Estimacin maximoverosmil ARIMA
Algoritmo numrico de optimizacin: Nelder&Mead
Informacin general
Serie
Nro.Obs.
Media Varianza muestral
Serie3(1)
96
-0.11
2.97

Desvo Estndar
1.72

Informacin sobre pronsticos realizados


Obs
Pronstico
LIP(95) LSP(95)
97.00
-0.01
-1.96
1.94
98.00
-0.47
-2.49
1.55
99.00
-0.12
-2.52
2.28
100.00
-0.29
-2.79
2.20
ECMP:
0.86
------Resultados de la Estimacin
Parmetro
Estimacin
Cte
0.02
=> Mu_Y
0.12
AR(1)
0.28
AR(2)
0.59

Error Std
0.10
0.77
0.08
0.08

Medidas Resumen y Validacin


Estadstico
Valor Observado
Verosimilitud
-132.35
CMResidual
0.99
R^2
0.67
R^2 Corregido
0.67
Akaike IC:
0.05
Schwarz IC:
0.13
Hannan-Quinn IC:
0.08
------------------Iter.para Converger:
139
MAD
0.68
Rango Residuos
4.81
Asimetra Residuos
-0.11
Kurtosis Residuos
2.68
Normal.(Jarque-Bera)
0.62

t-val
0.16
0.16
3.34
7.08

Valor p
0.8738
0.8747
0.0012
<0.0001

Valor p

0.7347

Funcin autocorrelacin: r(k) (se muestran slo los 5 primeros lags)


Lag
Coef
se r(k) t-val Valor p Signif
1
-0.01
0.10 -0.11
0.9110
2
0.01
0.10 0.12
0.9076
3
0.09
0.10 0.85
0.3964
4
0.13
0.10 1.29
0.2010

234

Time Series
5

0.16

0.11 1.52

0.1327

Funcin autocorrelacin parcial: phi(k)


(se muestran slo los 5 primeros lags)
Lag
Coef
se r(k) t-val Valor p Signif
1
-0.01
0.10 -0.11
0.9110
2
0.01
0.10 0.12
0.9086
3
0.09
0.10 0.86
0.3947
4
0.14
0.10 1.33
0.1875
5
0.17
0.10 1.64
0.1073

In the information about forecasts made output, you can see the block of general
information about the series and the forecasts for observations 97 to 100, with their
respective limits for the prediction interval (LLP and ULP). The mean square error of
prediction is 0.86.
Table Estimation Results, shows the parameter estimates with their standard errors, T
statistics and p values calculated under the assumption of normality. Cte=0.02 term refers to
the constant of the AR polynomial, which is related to the expectation of the form
data Cte /(1 +

p
j =1

AR( j )) and the table is indicated by ==> Mu_Y and for example its

value is 0.12. This parameter is not statistically significant because the series originated
from a process with zero expectation.
The AR coefficients are statistically significant estimates of 0.28 and 0.59 (the range under
study was generated by an AR(2) with coefficients 0.3 and 0.5).
The Summary statistics and validation table shows statistical values that can be a good
fit. Normality test indicates that there is no evidence to reject the assumption of normality in
the series of residues. Finally, we present estimates of the autocorrelation function and partial autocorrelation function of residues which suggest that there is no correlation
between them (remember they were generated independently).

Fitting and smoothing


MENU STATISTICS FITTING AND SMOOTHING allows the use of tools to
describe trends in the dependent variable (Y) as a function of one or more explanatory
variables (X). Implemented smoothing techniques do not require the specification of a
model, are useful for filtering variation in the scatter plot of a dependent variable that
hinders the visualization of trends in this variable over X.
InfoStat provides linear smoothing, i.e., in the smoothed series each element is a linear
combination of the item in the original series. While the smoothed series provides a biased
estimator of the trend, may represent a substantial gain for the interpretation due to the
lower variance of the smoothed series. To choose between several linear smoothed usually
compare the mean square errors (bias squared plus variance). While smaller this value is,
better the strategy of smoothing is.
Different smoothing techniques are based on the choice of a function to be applied in the
neighborhood of each observation. The neighborhood is defined as the set of observations

235

Time Series

(Xi, Yi) that are above and below (or before and after) each value of the regressor variable
(except Xmax and Xmin).
When using smoothed is recommended to graph simultaneously smoothed observation
series and original in function of the regressor variable or an ordered sequence to confirm
that the smoothing technique is not providing a false signal. InfoStat can automatically get
these graphics from FITTING AND SMOOTHING menu.
Some adjustment techniques, such as the
polynomial fit and seasonal adjustment, can be
interpreted as a special form of smoothing.
InfoStat adjusts high-order polynomials to remove
high-frequency irregular fluctuations.
In the smoothing window is/are chosen
dependent variables (variables whose tendency
wants to maximize smoothing) and if the variable
to be considered in the X axis should be chosen for
Regressor/ordering variables (optional). If this
variable is not specified, the sequence will be
sorted as they have been entered Y data in the table
(from case 1 to the case T, where T is the number
of cases). To OK a second window will enable you
to select the smoothing technique and if it is
required should take a decision on the width
window, a value that determines the number of
observations to be considered as belonging to the
neighborhood of each observation. For this, the
user should refer to the field width window. If it is
used repeatedly softened to a single variable is generated for each smoothing a new column
in the data table. To avoid the proliferation of columns to be activated Ovewrite field. You
can also request Plot to graph the smoothed series, Include original series, for believing in
the same graph the original and smoothed data and Partitions in the same plot to represent
two or more series smoothed.

Smoothing techniques
Moving average and non-central mov. ave.: the smoothed moving average type based on
the use of Y values corresponding to a moving window of points in X, to estimate trends in
Y. For a moving average window of width n, the t-th value of the smoothed series is the
arithmetic mean of the observations Yt, Yt-1, Yt-2, ..., Yt-n +1 of the original series. That is, the
non-central smoothing only uses as neighbors values above the value that is being
transformed. If the series is a time series (sequence ordered in time) you could say that only
used past values. InfoStat also allows for moving average where the neighborhood of each
data consists of the nearest neighbors and not necessarily those before. In a moving average
window with width 5, each Y value is replaced by Yt beingYt the mean of 5 observations
closest to Yt. If the variable X assumes equidistant values, the moving average for the

236

Time Series

window 5 for each Yt observation (except for ends of the series Yt) is calculated from
observationsYt 2 ,Yt 1,Yt ,Yt 1,Yt 2 .
Moving median and non-central moving median: the smoothed of moving median is
obtained by analogy with the moving average type (see Moving Average), but the function
applied to obtain the smoothed value, Yt, is the median of the values of neighborhood rather
than the average.
Exponencial smoothing: model for the use of moving averages (see non-centered moving
average) exponentially weighted so that higher weights are assigned to the Y values closer
to Yt. The t-th element value of the exponentially smoothed series is:

Yt aYt a(1 a )Yt 1 a(1 a )2Yt 2 ...


where the summation extends throughout the series retrospectively. Recursively to obtain

Yt used by InfoStat is:

Yt = aYt + (1 a)Yt 1
Small a values determine a greater smooth.
Double exponential smoothing: smoothed data from exponential smoothing (see
exponential smoothing), are softened back with an exponential filter. This will allow a
smoother without giving much weight to individual observations above (the past).The t-th
element of the double exponentially smoothed series is:
Yt = aYt + (1 a )Yt 1

Holtz-Winters smoothing: are obtained by applying the simple exponential smoothing


function with the addition of mean changes in trend (increase or decrease). The t-th element
of the smoothed series is obtained from the following system of equations:
Yt t = aYt + (1 a )(Yt 1 + rt 1 )
r=t b(Yt t Yt 1 ) + (1 b)rt 1

where a and b are weights (smoothing parameters) that take values in the interval [0,1] and rt
a smoothed series representing the average exchange rate in the series Yt .
Note: For specific applications of these smoothed for forecasting time series (see Time series).

237

Graphs

Graphics
InfoStat chart component consists of two windows with the user interface. A window of
graphical tools and another called graph showing the graphics themselves. These windows
are updated to reflect the characteristics of the chart you are viewing.
An InfoStat graphic is a series collection of graphics, a system of coordinates, a title, a
collection of text items and a legend. A graphic has a default size that the user can modify
by changing the size of the graphic window. On the active graphic can select options
associated with the right mouse button.
Each graphic series is a set of graphical elements of a single type (points, segments of a
cake, rectangles, lines, etc.). Graphic series are attributes that the user can modify as name,
color, form elements and order in which they plotted the series (if the graphic has more than
one), etc. There is a menu of options for a graphic series that is associated with right mouse
button and is applicable to the selected series from the Series tab of the Graphics tools
window.
The graphic elements of a series have a body (which is that changes color and shape), a
border that surrounds the body, arms (upper and lower) and terminations of the arms.
The coordinate system has an X axis and one or more Y axes (possibly one for each of the
series), each axis has its legend, scale and type, which can be numerical or categorical. The
most important properties of an axis representing numerical scales that the user can modify
are the minimum and maximum number of decimal places, the number of divisions of the
numerical scale, the thickness and color. For categorical X axis, the user can also change the
sequence of categories and their names. In the case of a large number of categories, you can
present them in a sequence that alternates its position on the shaft for better readability or
show only a subset of them. The axes may include cutting lines whose thickness and color
can be changed.
The title, the legends of the axes and text displayed on the graph, either because the user
added or because they are automatically shown InfoStat, are objects that can be moved,
edited and changed aesthetically. In the Graphic tools window and the corresponding flaps
(Series, Xaxis, Yaxis, Tools), will find the options for it.
The legend is an object linked to the series and the number of entries in a legend
corresponds to the number of series whose legend is visible attribute. The associated name,
color and shape of graphic element and the order in which items are presented in the legend
corresponds to the name and order of the series. The position of the series can be modified
in which case it will modify the position of the legend. To access the legend options to
select the same with the mouse and turn right.
Here are the tools available to modify and adjust a chart. The presentation begins with a
description of Graphics Tools window and then the graphics window.

238

Graphs

Graphic Tools
This window always appears at the graphics window. If it is closed, triggering active again
left mouse button when the cursor is positioned over a chart. It has four tabs: Series, X axis,
Y axis and Tools.

List of
series
section

Options
section
Title
section

Series Tab
This tab activates an organized panel into three
parts. The top section contains a list of the series
included in the graph. The middle section presents
options applicable to the series selected from the list
above. Its content depends on the graph of the
selected series. The lower section contains an edit
field of the graphic title
and some buttons that
activate options on it
(changing
the
font,
replace it in its default
position and make it

visible or invisible).
To modify the properties of a graphic series, and the color, the
series must be selected. A number is selected by activating the
left mouse button with the cursor pointing to the name of the
series listed in the chart series. Several series can be selected
simultaneously by dragging the mouse over them with the left
mouse button. If one or more series are selected are highlighted
with a darker color background. A selected series may move to
alter the order of plotting holding down the <Ctrl> key and
triggering the movement keys (arrows) up or down. The change
in position of the series is reflected in the order of legend
entries.
One or more selected series and the cursor on any of them enables, pressing the right mouse
button, a menu of properties of the series that can be modified. The items presented on this
menu depend on the type selected graphic series. In turn, appear active or not, depending on
the particular characteristics of the series.
Edit opens a dialog box to change the name of the series. This window is also activated by
double-clicking on the name of the series or by pressing <Enter> when the series is
selected. When the name of a number is changed, also it is changed its entry in the legend of
figure. This is the way to edit the contents of a legend.
Color item displays a submenu with a list of available colors. By selecting one of them, the
body of the graphic elements of the series change color and does the proper name with
which the series appears in the list.

239

Graphs

Smooth item is enabled for example for scatterplots or dot plots. When activated, InfoStat
generates a smoothed series that is added to the graphic made and which also appears in the
window of the Series. This series has its own options to modify it (see Smooth and
adjustments).
Draw contours item will be enabled when the enabled graphic series are diagrams scatter or
diagrams of points among others. When it is activated, InfoStat generates a series contour
that is added to the graphic and as the smoothed series is also added in the series
window. This series has its own options to modify it and automatically appear when
painting the smoothed series.
The Visible/Invisible item makes visible or invisible the selected series. If a series is
invisible, its name appears in the window of the Series, with a shade different, clearer.
New Y axis, adds a new Y-axis for each of the selected series. The new lines will turn to the
right and left of the graph and receive a serial number starting at 1 (one). The original axis is
the axis of the graph is 0 axis (zero). The position of an axis Y can be modified (as will be
explained later). The "additional" Y axes can be deleted. In the last case the reference axis
for the series whose scale is reflected in the deletion axis becomes the axis again 0 (zero).
Order X axis item can sort the values of X variable in alphabetical order, according to
increasing values of Y and according to decreasing values of Y, if your nature is
categorical.
Symbol item allows to change the body shape of the graphic elements that make up the
series. Available forms are listed as a submenu of this item. The shape of the graphic
elements cannot be changed in all types of graphics series, in which case the item is
disabled.
The Body item can choose the following options: Fill/Empty, With/Without border. For
example if the body is solid points to select the first option will be only the outlines of it. If
you re-enable this option it will be filled again. The same applies to the edge of the symbol
if the option is invoked with/without border.
Errors item contains options for visible or invisible errors associated with the measure
represented in a graphic element. Errors, whether standard errors, standard deviations,
confidence intervals or prediction intervals are represented in symbolic form with a segment
whose length is related to the magnitude. This item is not enabled for all types of graphic
series.
Identifiers item activates and disactivates the display of signs
that identify the elements of the graphic series. These labels
can contain the coordinates of the graphic element on the Y
axis, the coordinate in the X axis, coordinates (X, Y), the case
number that represents the item (if applicable) or contain an
arbitrary label. Some graphical techniques such as scatter plots,
to associate a column in the data table with the labels to be
shown.
Under the identifiers item also include options to change the
look of the labels. Any visible sign in a graphic can be moved

240

Graphs

or edited within the graphic. When one of two things happens, the label is "touched." Under
this condition the label is anchored to the place where it remains anchored moved or where
it was edited. Then, if the graphic is resized, or changed the scales of the axes, these "signs
modified" "move" slightly differently from the rest.
Back to his default position and content, must be selected inside the graphic label by
pressing the left mouse button while pointing the cursor and then activate the Reset item at
the end of a pulldown menu that is activated by right- mouse.
Connectors item is used to make visible or invisible lines connecting the bodies of the
graphic elements of a series (connectors). Under this menu there are options to change the
color, thickness, fill (weft) and flat display connectors.
Enclosures item makes visible or invisible a special kind of connector that joins the ends of
error bars, and to build confidence or prediction bands. As connectors item, this menu has
options to improve the appearance of connectors such as color and thickness.
Remove graphic series item eliminates one or more pre-selected series.
Check to copy item is used to copy a graphic series and superimposed on another for the
purpose of observing the behavior of two variables. An example is given below.
Example 19: Base1data file corresponding to the Permanent Household Survey carried out
by INDEC in the first quarter of 2006. Base1 file using the following graph was obtained:

Figure 42: Bar chart for number of home members lesser tan 10 years (IX_MEN 10) and dot
plot for family total income (ITF) versus for number of home members lesser tan 10 years
(IX_MEN 10). File Base1.idb.

241

Graphs

This figure was obtained by a bar graph with the relative frequency for "IX_Men10" as
classification criteria and putting "Num_Home" as a variable to be plotted. After accepting,
the drop-down menu has by default to "Mean" select Relative Frequency. This was followed
by a dot plot for "ITF" versus "IX_Men10." On the last graphic select Series (Series tab)
and select with right buttom to copy, finally on the action graph bar and right mouse button
and choose Add series marked to move. It was also added an additional axis for the
number missing (ITF) by selecting the series and with the right mouse button activating
new axis y. To the graphic can add a legend (on the graph press right button and select
Show legend) to associate the symbols and colors to the analyzed variables, also on the
legend by pressing the right mouse button on the menu that appears is elected Border,
Invisible. Finally on the ITF series, activate the smoothing and then on this new generated
series in the same window of the Series tab, select Poli and in the adjacent box 1, then
update (Refresh button). The smoothed series was deleted from the legend (check and
activate the smoothed series Legend visible/invisible).

X Axis Tab
This tab presents a panel consisting of three sections. The first controls attributes of the
scale. The second controls the thickness and color of the shaft. The third displays the legend
and makes editing, changing their attributes typographical and eventually replacing the
default location.
The scale attribute section contains edit fields for the minimum, maximum of axes and a list
where, optionally, you can specify the x-axis points to draw cutting lines. InfoStat lines
drawn perpendicular to the X axis at the points indicated above. The number and position of
these lines, and its color and thickness are controlled by the user.
The number of scale divisions (ticks) and the number of decimal places that shows the scale
can be changed. It should be noted that, when you include identifiers that contain the X-axis
coordinate, the number of decimal places to which you receive this value corresponds to the
number of decimal places specified for the axis scale.
If the X axis scale is numeric and the minimum is greater
than 0, you receive an additional edit field labeled
potency. By default this field editor contains a 1. If you enter
a () value in this field, InfoStat will graph in X scale.
When =0, InfoStat uses logarithmic scale for axes values.
The command button bar, which is the basis of the attributes
section of the scale, allows left to right as follows: reverse the
scale (lowest to highest or vice versa), specifying whether the
labels associated with the divisions of the scale is displayed
in one or two alternating lines for easier reading, change the
font of the same, back to the original position all the labels
on the scale and the scale display in decimal or exponential.
When the X axis represents a categorical variable the devices
for controlling the number of ticks and the number of
decimal places disappear. Edit fields also the minimum and

242

Graphs

maximum are replaced by a list of the categories represented in the axis. In this list (which
appears in the subwindow xAxis flap), the name of the categories can be edited (double
click or <Enter>) and its position altered (Ctrl+arrow key up or down direction).If the
position of a category is changed, this will be reflected in the graphic.
If the number of categories is large, as can happen when time series are plotted, it may be
impossible to read all labels belonging to the divisions of the scale due to overlap. To avoid
this inconvenience may be disposed in a double row labels. If this option is not sufficient to
resolve the overlap, there is a device in the command buttons bar (shown only when the X
variable is categorical) that allows you to specify the number of categories to be counted,
from the first category visible to find the next visible (default value is one and in this case
all categories are visible).

Y Axis Tab
This tab is similar to that described for the X axis. However, this is updated with the Y axis
can be selected as many axes Y as series in the graphic. An axis is selected by pressing the
left mouse button when the cursor points to the desired axis. The number of "active axes"
shown in the top right of the frame that defines the attributes section of the scale.
In the maximum and minimum editing fields, are added a field called Base where you can
write a reference value for a graph with bars which are projected above this value if the
difference between the average and it is less positive and below if negative (see example in
section bar graph). The edit field of the database is protected and any changes will not take
effect while the protection is activated. To remove protection, you must clear the checkbox
to the left of the field, click with the cursor positioned on it, then write the value for the
base.
Trim lines on the Y axis are equivalent to the X axis, but each Y axis can have its own court
lines. The button bar at the bottom of the Attributes section of the scale is similar to that
described for the X axis tab but includes: a command button with
the icon of a trash container which when activated remove the
selected axis. This action applies when you are at least two axes
and you cannot remove the axis 0 (always should be a reference
axis, although it may be invisible). Additional buttons allow you
to reposition the selected axis to the left or right of the graph
(Invert axis position button) and move the dots to separate it
from the graphic elements (buttons left and right arrows) whose
coordinates in the x-axis do that eventually, the body of graphic
elements overlap with the Y axis.
Like the X axis, Y axis have an edit power field that is applied
identically to that described for the X axis

Graphs Window
In this window, InfoStat stores all the graphics that are made in
a workshop. The graphics are numbered consecutively from
zero. Clicking on the numbered tabs displayed at the bottom of
the window enters the graph required. You can also choose the

243

Graphs

graphics created in this fast command window CTRL+Left Arrow or CTRL+Right


Arrow. Also, CTRL+End and Ctrl+Home to the left and right arrow can go to the last and
first graph created respectively. All actions that can be done about this window are linked to
a menu that is displayed with the right mouse button when the cursor is positioned over the
graphics window. The actions are: Copy, to copy the graphic to the clipboard to be
transported to a word processor such as BMP Copy copies the graph bitmap format (this
format is universally recognized by any word processor and is designed to avoid problems
of incompatibility with the Improved Windows Meta File format which InfoStat uses by
default when you copy a graphic into the clipboard) Save as JPG and Save As EMF opens
a dialog box that let you save the active chart in these formats. InfoStat by default saves the
graphic with file name from which it generated over the corresponding number on the flip
graph preceded by underscore. Save graphic opens a dialog box that let you save all the
graphs constructed in the format of InfoStat (generates a file with a. IGB, which are
recognized by InfoStat); Open graphic allows to open the saved graphics as IGB in the
Graphics window of InfoStat, background color, allows to change the background color
that by default is white, Show Legend, show or hide the legend, Grid, activates and
disactivates a grid that corresponds to the subdivisions of the scale of the X and Y (a
submenu provides the touch of appearance), Print, opens a dialog box to print the active
graphic, Delete, deletes the active graphic and Delete... provides options for deleting
forwards or backwards, or delete all the graphics created. In addition there are several menu
items related to the copy or subscription of formats that are detailed below. Animate option
allows you to display alternately all graphics created to date and can be turned on and off
with the quick command CTRL+A. Panel option creates a copy of the active graph for
displaying graphics built to date simultaneously. These graphics are provided in panel
attached to the viewport, so all changes made on them in this window will be reflected in
your copy in the form of panel. The Add Text allows to add a line of text, with the same
characteristics as explained in Tab Tools, Text button. Add series marked to move, allows
to add the series that were marked for copying the Series flap.
Student Version InfoStat do not have accessed the options for saving graphics. In addition,
graphs generated with the student version will appear with a watermark that refers to the
release.

Subscription and copies of graphic formats


InfoStat classifies its graphics as Editor or Subscribers of format or free format. Editor is a
graphic format that will be used as a model to copy attributes. Attributes and aspects of
those attributes that you wish to transfer are specified using the Preferences copy and
subscription submenu, which can be accessed by activating the right mouse button on the
graph editor. This submenu activates a dialog box: Graphics: subscription preferences
where you find all the options (Options tab). The option to copy attributes of the series by
name or by order is to assign characteristics match series as the order in which they plotted
or by name.
Subscriber graphics "point out" to Editor which copies the property. This link is dynamic so
each time the editor is changed; these characteristics are reflected in the Subscriber. For a
graph becomes a subscriber, you must first activate it by clicking on the corresponding
numbered tab, located at the bottom of the window. Then you must select the graphic editor

244

Graphs

from the drop down menu to activate the right mouse button Subscribe to graph format....
All graphics can Editor except those who are already subscribers. There is a simple
alternative to multiple subscribers simultaneously declare the same Editor. To access the
Editor must be active and must display the menu by activating the right mouse button, then
choose the copy and subscription preferences. On the Subscribers flap you can select the
Subscribers in the Editor chosen.
A graphic can stop being Subscriber at any time; in this case the attributes will be
disconnected from the graph editor. This is accomplished with the command unsubscribe
format right mouse button. For the subscription mechanism to work properly with the X axis
categorical changes in the names of the categories and location of them in the axis of the
editor, must be made subsequent to the subscriptions. You can also copy the attributes of a
graph, without signing up, simply by using the Copy graphic format ... submenu.
Note: If the x-axis represents a categorical variable and is intended to change the position of the
different categories on the chart editor for these changes are reflected in the graphics subscribers, then
activate the "cells' minimum and maximum copy "of X axis

Legends
The legend of a graph is an object with its own menu of
options, which is activated by touching the legend with the
mouse and pressing the right button. The legend options
are: position (Right, Left, Up, Down and Free), Contorn
(Visible, Invisible) alternates between presence and
absence of frame respectively, number of columns allows
to modify the presentation of the legends, Font (opens a
dialog to change printing), Color (can choose a
background color for the legend), Transparent
background makes the legend without color and Hide to
avoid seeing the legend. To change the titles of the legend, the order or icons, you must edit
those features for graphic series in the graphics tools window, Series flap.

Lines of text
The introduction of lines of text in graphics described previously Text button. The labels,
the graph title and axes legends are all editable objects options presented in the right mouse
button once you are selected. (You can copy, paste or Hide the text, different font,
orientation, color, background, or edit the selected text. Legends of the axes and the title
of the graph have default positions that cancel when these signs moved. To recover these
positions have the option to reset on the submenu. The scale numerical labels cannot be
changed or moved. To change the default font of the text lines see text button.
In addition to specific options to the graphics window are a set of actions that can be
performed directly on the created graphics. A Graph of InfoStat is a collection of objects,
each interrelated but likely to be modified independently.

245

Graphs

Some of these objects, as X axis and Y axis have their


options in the Graph Tools, others can be modified
directly from the graph. For example, the points that
forms a scatter diagram can be modified in general in the
Series of Graphics Tools tab, but it can be modified
individually from the graph. If you touch these points
with the left mouse button is placed beside the case
number in the source file. If you touch a spot with the
right mouse button, you will see color chart and this
item displays a color bar to change the color of that
particular point.
InfoStat can make the following graphics. They are
accessed through the menu graphics in InfoStat main
window, these are: Scatter chart, dot plot, bar chart,
box-plot, dot density plolt, Q-Q plot, Empirical
distribution fucntion, histogram, multivariate
profiles, star plot, pie plot, stack bar plot and matrix
scatter plot matrix. It also has a function plotter and
superficies and contorns.
As in the statistics menu, the Graphics menu of InfoStat
presents two dialog boxes. The first (selector variables)
serves to establish the variables that will be used to build
the chart to define partitions or specify some attributes
like size and labels of graphic elements. The second (window options) is to adjust various
characteristics of the different graphic types, and whether the graphics produced by
partitions are on the same graph or on separate graphs. This window may not appear if the
graph does not require specification of options.
Below are the different graphic types. Due to the wide variety of choices and combinations
InfoStat provides for graphical displays, the descriptions of different types will initially
presenting the general characteristics of the graph and then exemplifications will be shown
various options and details to change their appearance.

Scatterplot
The scatter diagram is a typical graph showing a set of dots arranged in the plane by its
coordinates X and Y. It is used when you want to view the joint variation of two quantitative
variables. You can only represent a variable in the X axis but simultaneously several
variables can be plotted on the Y axis
The following scatter diagrams show the relationship between germination percentage (GP)
and percentage of normal seedlings (NS) located in the Atriplex file.

246

Graphs
91

90

67

67

Normal seedlings

Normal seedlings

big

43

medium
small

44

20

20

-3

-4
9

33

56

80

104

Germination

Figure 43: Scatter plot of Normal seedlings


versus Germination. File Atriplex.

33

56

80

104

Germination

Figure 44: Scatter plot of Normal seedlings


versus Germination by seed size. File
Atriplex.

Figure 43 shows an elemental scatter plot. This figure was obtained by assigning the
variable selection window Normal seedling" the Y axis and Germination to the X axis.
Figure 44 also was made by a scatter plot of Normal seedlings versus "Germination", but
using the "size" of the seeds as partitioning criteria. The charts are generated for different
seed sizes were placed in the same figure, indicating, in the dialog box that the partitions are
included in the same graph. To identify the different seed sizes were used with different
color points (select the Series and the right button select color) and different symbols (select
the series and select Symbol right). The size of the graphic elements increased to 10 points.
It was included a cut on the Y axis 45 corresponding normal seedlings. It can be possible
add a legend (on the graph press right button and select Show legend) to associate the
symbols and colors with sizes of seeds (partitions). This legend was placed in position>
free, and thus it could take in the graph. In addition, the legend by pressing the right mouse
button on the menu that appears Contorn, Invisible was chosen.
When you have a scatter plot like the one presented in the graph in Figure 43, you can fit a
line or curve that summarizes the relationship observed between the variables.Selecting one
(or several) of the series and pressing the right mouse button may be asked to apply a
smoothing with InfoStat. This option generates a new series smoothed by each of the
selected series (Figure 45). By default InfoStat applies as locally weighted regression
smoothing (LOWESS) (Cleveland, 1979). When a smoothed series is selected, options
appear in the window of graphical tools that let you specify different parameters or
smoothing for the same and update the figure according to them. Figure 46 shows a
LOWESS smoothing has been changed by adjusting a straight line (polynomial of order 1).

247

90

90

67

67

Normal seedlings

Normal seedlings

Graphs

44

44

20

20

-3

-3
9

33

56

80

104

Germination

Figure 45: Scatter plot of Normal seedlings


versus Germination by seed size, smoothing
whit LOWESS. File Atriplex.

33

56

80

104

Germination

Figure 46: Scatter plot of Normal seedlings


versus Germination by seed size, smoothing
whit first order polynomial. File Atriplex.

Dot Chart
This chart type is similar to the scatter diagram and is intended to show a summary measure
of a set of data and not individual values. The most common is trying to represent mean
values of a variable "Y" in relation to a quantitative or categorical variable "X". If the X axis
is assigned a quantitative variable, InfoStat gives the option of treating it as categorical, in
which case each different value is considered a new category (default).
These graphs, like bar charts, may have associated line segments that represent measures of
variability (by default, the standard error of the mean). Below is a scatter plot for the
diameter of the body of a nematode ("Diamcpo") growing at different temperatures
("Temperature"). We used the females file (courtesy Dr. M. Doucet, Faculty of Agricultural
Sciences, UNC), entering the Body_Diam variable as Variable to plot and the
Temperature variable class criteria.
Points representing average values can "connect", by selecting the series you want to
connect and then, with the right mouse button to activate Connectors>Visible. The
connectors are used to give an idea of the level of the variable at intermediate
temperature, as shown in the following figures:

248

23.43

23.43

22.71

22.71

Body_Diam

Body_Diam

Graphs

21.99

21.26

21.99

21.26

20.54

20.54
16

20

25

30

16

20

Temperature

Figure 47: Dot plot showing connecting lines


for Body diameter versus Temperature. File
Female.

25

30

Temperature

Figure 48: Dot plot for Body diameter versus


Temperature. File Female.

23.43

23.43

22.71

22.71

Body_Diam

Body_Diam

These graphics of points can add confidence or prediction bands, whether parametric or
nonparametric. These bands are obtained joining the endpoints of the bars representing the
errors. This visible effect is achieved by the Surrounders. For this select the
number you want to wrap and then with the right mouse button to select Surround>Visible
(Figure 49). Since the points and errors can be made invisible (with the selected series
looking at the right click menu Main>Empty and No Border and
Errors>Invisibles>Both) to achieve confidence band (or prediction), free of other graphic
elements (Figure 50).

21.99

21.99

21.26

21.26

20.54

20.54
16

20

25

30

Temperature

Figure 49: Dot plot showing envelope and


standard errors for Body diameter versus
Temperature. File Female..

16

20

25

30

Temperature

Figure 50: Dot plot showing envelope for


Body diameter versus Temperature. File
Female.

249

Graphs

Bar chart

81

81

72

72

Germination

Germination

This diagram represents mean values (optionally medium, absolute frequencies, relative
frecuencies, minimum or maximum) of one or more variables in relation to one or more
classification variables. You can add to the representation of the average values, a measure
of variability that may be the sample standard error of the mean (default), the sample
standard deviation, a confidence interval for the mean for a given confidence level, a
prediction interval parametric or nonparametric measure of variation or arbitrary. The
figures below represent average values of germination percentage of seeds of different sizes,
using Atriplex data file.

63

63

53

53

44

44
small

medium

big

Size

Figure 51: Bar chart (wiht symbol equal


diamond) of Germination by size. File
Atriplex.

small

medium

big

Size

Figure 52: Bar chart of Germination by size.


File Atriplex.

If you specify one or more criteria of partition, InfoStat gives the option to generate a graph
for each partition or put all the partitions on the same graph. The following is an example
showing for Atriplex data file, the percentage of seed germination of small, medium and
large according to the color of the seed coat. For this was a bar chart of the percentage of
germination (selected as variable Y), related to seed size was declared as a variable X. The
coat color is specified as the partition (Figure 53).

250

Graphs
81.4596

0.004

61.9178

72.1134

0.004

50.3366

0.003

38.7555

53.4210

0.003

27.1743

44.0748

0.003

38

62.7672

Normal seedlings

57

DW

77
Germination

Germination

96

18
small

medium

big

Size

15.5932
small medium big

Size
dark

reddish

yellow
Germination

Figure 53: Bar chart of germination for


different sizes, by color. File Atriplex

Normal seedlings

DW

Figure 54: Bar chart of germination, DW,


and Normal seedlings for different sizes. File
Atriplex

Just as a variable can be represented in various conditions and according to various criteria
of partition on the same graph, it is possible to simultaneously display multiple-variables on
one graph. Since the variables can be measured on scales that are not compatible, InfoStat
can add as many axes as the graph series has. Figure 54 represents the average values of
germination percentage, percentage of normal seedlings and dry weight of seedlings from
seeds of different sizes.
On the YAxis flap there is the Base field in where you can write a reference value to make
bar graphs that are projected above and below this value. Putting a base value , all response
values for each value in the X axis in excess of the proposed base value, will be plotted
above this value, while those who do not pass it will be plotted below representing the
difference respect to the base value. The edit field of the base is protected and any
changes will not take effect while the protection is activated. To remove protection, you
must clear the checkbox to the left of the field, click with the cursor positioned on it, then
write the value for the base. Using the file Atriplex, Figure 55 shows a bar graph for normal
seedlings in relation to the color of episperm seed based to minimum value, in this case 10
and Figure 56 a similar but based=40 . Figure 55 is marked on the Y axis and three trim
lines: 62, 43 and 22, to represent the average number of normal seedlings for each color of
episperm. Note that in Figure 56 is plotted the difference with the value 40.

251

70

70

55

55

Normal seedlings

Normal seedlings

Graphs

40

25

10
yellow

reddish

Color

dark

40

25

10
yellow

reddish

dark

Color

Figure 55: Bar chart for Normal seedlings by Figure 56: Bar chart for Normal seedlings by
seed Color, with base line equal to nimimum seed Color, with base line equal to 40. File
value. File Atriplex.
Atriplex.

Box plot
Bar and points charts summarize the information in one sample spot. Eventually they may
include a measurement error or variability. However, it is difficult from them to visualize
how the frequency distribution of each set of observations. The box plot, aims to better
reflect the shape of these distributions occurring in the same graphic information about the
median, mean, quantiles 0.05, 0.25, 0.75 and 0.95 and showing the presence, if any, of
extreme values. The specification of the variables in the selector variables of this type of
chart is identical to the dot plot.
Figure 57 shows the box plot for the variable body diameter ("Body_Diam") of nematodes
grown at different temperatures (females file). For temperatures of 21 and 28C are shown
on the graph representing each box plot details. The labels were added with the Text tool
from the Graphics tools window.

252

26

Quantile 0.95
Quantile 0.75

22

Me

20

Tail

Body_Diam

24

38.65

26.38

35.08

24.31

31.50

22.25

27.93

20.19

Quantile 0.25

Quantile 0.05

24.35

18
16

Body_Diam

Graphs

20

25

30

Temperature

18.13
16

20

30

Temperature
Body_Diam

Figure 57: Box plot for Body_Diam


by temperatura. File Female.

25

Tail

Figura 58: Box plot for Tail and Body_Diam by


temperatura. File Female.

As in the bar chart, you can include in a single graph several variables, each (if necessary)
with its own coordinate Y axis, and/or multiple partitions. Figure 58 is plotted the diameter
of the body simultaneously ("Body_Diam") and tail length (Tail) of nematodes in this
example.
When you select a series of diagrams of boxes, graphics tools window, Series flap presents
a set of options to include a symbol that represents the sample mean (enabled by default)
and specify criteria for displaying extreme values.

Dot Density Plot


Although box plots provide important information on how the frequency distribution, it is
sometimes very useful to see directly where the cases actually observed, especially if their
number is small. The specification of the variables in this figure is identical to the dot plot.
Figure 59 shows a dot density diagram for the percentage of germination of seeds of
different sizes (Atriplex file). This chart included breakpoints in the germination
percentages of 10, 50 and 90%, entering these values in a list in the Graphics tools window,
Y axis flap, panel cutting lines. The cutoff of 50%, we used a thicker. Necessary to set the
cut off point with the mouse and menu that is displayed with right buttom selecting the
thickness. If you want to add color, choose Color from the same menu.

253

Graphs
104

Germination

80

56

33

9
small

medium

big

Size

Figure 59: Dot density plot for Germination. File Atriplex.

Q-Q plot
These graphs are used to evaluate the degree of fit of a set of observations to a theoretical
distribution. Although no formal tests represent adjustment, experience has shown to be
effective in detecting faults adjustment formal tests often are unable to detect.
The parameters of the selected theoretical distribution are estimated from the sample. The
options window of QQ-Plot enables you to select between various theoretical distributions:
Normal, Chi square, Exponential, Weibull, Gumbel and Beta. If you want, you can be
represented in the graph the line Y=X, activating the Show line Y=X.
The following is an example of normal Q-Q plot applied to a sample of 50 observations
from a normal distribution (Qqplot file). Data corresponds to the lengths of petals of 50
flowers. This graph has added the line Y=X.
Observed quantiles(Petal_length)

5.96

n= 50 r= 0.991 (Petal_length)

5.48

5.00

4.53

4.05
4.21

4.65

5.08

5.52

5.96

Normal quantiles(5.006,0.12425)

Figure 60: Q-Q Plot for 50 observation of Petal_length. File Qqplot.

254

Graphs

Q-Q plot in the graph are shown in the top sample size (n) and linear correlation coefficient
r of the correlation between the observed quantiles versus the quantiles of the theoretical
distribution selected. On the X axis label shows the theoretical distribution parameters
estimated from the sample by maximum likelihood.

Empirical distribution function


Although the polygon of acumulative relative frequency (available as option of Histogram
submenu) can be used to visualize the shape of the empirical distribution function of a set of
observations, the technique usually requires to have a large number of data to represent a
good approximation of the true distribution.
There are many techniques for building empirical distribution graphs of a variable. The
graph that InfoStat generated is based on calculating the distribution according to the
following algorithm: Let x(1), x(2), x(n) observations of a sample of n size ordered from lowest
to highest. The empirical distribution function evaluated at the observation x(i) is calculated
as F(x(i))=(i-0.375)/(n+0.25).
This graph shows the observed values of the variable on the X axis and the empirical
distribution function evaluated at each point observed on the Y axis
Here are two examples of graphs of the empirical distribution. Figure 61 shows the graph of
the empirical distribution of petal length in the pooled sample that includes three species of
Iris (setosa, versicolor and virginica, Iris file). The graph has grid-enabled modeling. The
length of the petal is one of the variables that best distinguish these species. The graph on
the empirical distribution function for the pooled sample shows a strong anomaly, with
respect to a normal distribution function, indicating a possible mixture of distributions, since
data may be plotted from three different normal distributions, each associated with a
species.
Figure 62 corresponds to the perimeter of white garlic heads the 1998 campaign To the
Garlic file (courtesy of Dr. V. Conci, IFFIVE-INTA). In the laat figure has added two trim
lines can easily identify the 50% percentile corresponds to the value 15. Given the large
number of cases, the size was reduced to 2 points. Eventually, the points could be made to
disappear by reducing its size to 0 and outline the shape of the distribution by putting the
connectors visible. Optionally you could ask InfoStat to generate a smoothed series for these
charts.

255

Graphs
1.00

0.75

Empirical distribution

Empirical distribution

1.00

0.50

0.25

0.00
0.70

0.75

0.50

0.25

0.00

2.33

3.95

5.57

7.20

Observed values

Figure 61: Empirical distribution function


for Petal_length. File Iris.

10

15

20

25

30

Observed values

Figure 62: Empirical distribution function for


Perimeter. File Garlic.

Histogram
InfoStat can build frequency histograms when you have enough observations. These
histograms can be used to approximate the underlying theoretical distribution. Practical
experiences show that a wide range of distributions can be approximated well from an
empirical distribution constructed from 50 or more observations. By constructing a
histogram, the Graphic Tools window displays a dialog
which lets you modify attributes of the histogram obtained. In
the Series tab of this window is a menu of options of
histogram can: change the number of classes (Classes) which
by default are calculated as log2(n+1), to fit (Adjustment) to a
normal, Chi Square, Exponential, Weibull, Gumbel, Beta and
Gamma distribution with parameters equal to the sample
estimates obtained by maximum likelihood. If you do not wish
an adjustment in the field must contain the word None. In
addition the window lets you choose the frequency represented
in the histogram (Freq). The frequency graph can be: relative
frequency (Freq rel) which is the default, absolute frequency
(Freq abs.), cumulative absolute frequency (Freq abs.
Cumulative), and cumulative relative frequency (Freq
rel.Cumulative). The Contorn field eliminates the contours
of the rods that form the histogram. Frequency polygon can be
constructed by selecting the polygon field. Main field
eliminates background histogram from which he drew the
polygon. FCLL and LCLU fields allow you to enter lower
and upper limits for the first and last class, respectively. To make the "tick" marks
correspond to the class of each interval activate M. classes in the Series tab.

256

Graphs

Here are relative frequency histograms (Figure 63) and related accumulated (Figure 64) for
the variable scope of white garlic heads a test of plant protection (Garlic file). In Figure 64
have been added breakpoints in the Y axis to the value Y=0.50 and on the X axis to the
value X=17. In both graphs we used the double row technique for X-axis scale so that the
scale labels do not overlap.

1.00
Cum. relative frequency

Relative frequency

0.24

0.18

0.12

0.06

0.00
7.02

26.62
22.70
18.78
14.86
10.94
24.66 28.58
20.74
16.82
12.90
8.98
Perimeter

Figure 63: Relative frequency histogram and


polygon for Perimeter. File Garlic

0.75

0.50

0.25

0.00
7.02

26.62
22.70
18.78
14.86
10.94
24.66 28.58
20.74
16.82
12.90
8.98
Perimeter

Figure 64: Cummulative relative frequency


histogram and polygon for Perimeter. File
Garlic

Multivariate profile chart


When you have repeated measures from a variable over time or several variables measured
in the same individual or experimental unit of interest can be visualized as is the response
profiles.
InfoStat can plot for each variable, or a variable at different times, a point or bar represents
the mean value, a box plot summarizing the form of distribution or a density plot points.
This graphing technique requires that the variables and measures at the time whose profile
you want to plot, are in different columns of the data table as shown inthe next window,
which corresponds to the Prosopis file data.

257

Graphs

Example 20: In Prosopis sp. plants of the same height was recorded in 9 opportunities from
sowing until day 498. It was used as class criteria the origin of the seeds. There are 23
sources and for each there are six records. The data is in file Prosopis (courtesy Eng. G
Verzino, Faculty of Agricultural Sciences, UNC).

112.72

96.91

87.66

76.13

Common scale

Common scale

To generate a multivariate profile, assign all variables in profile to the list of variables in the
variable selection window. You can specify a sort criteria if you want to see more than one
profile on the same graph. In this case there will be many profiles as different groups
obtained from the classification criteria. InfoStat default proposes a diagram of connected
points. If the profiles shown are the result of averaging the response of several repetitions,
then it may make sense to make visible the error bars. By selecting the corresponding series
and right buttom choose Errors>Visible>Both. Below is a diagram of multivariate profiles
for the height of carob plants. Figure 65 shows the profiles for all sources, and in Figure 66
are only three 3 origins (profiles) to which were added error bars represent the standard
error of the mean.

62.59

37.53

55.36

34.58

12.47

13.80

D030 D060 D090 D120 D197 D324 D434 D498

D030 D060 D090 D120 D197 D324 D434 D498

Figure 65: Multivariate profile (dot plot) for Figure 66: Averaged multivariate profile (dot
Prosopis tree heigth from different sources of plot) for Prosopis tree heigth from different
seeds, measured in 8 times. File Prosopis.
sources of seeds, measured in 8 times. File
Prosopis.

258

Graphs

The following figure shows the evolution of height, without discriminating between sources
and using box plots (box-plot) and graphic elements of the profile, which were joined to
make visible the connectors.
135.03

Common scale

102.82

70.61

38.40

6.19
D030

D060

D090

D120

D197

D324

D434

D498

Figure 67: Averaged multivariate profile (Box plot) for Prosopis tree heigth from different
sources of seeds, measured in 8 times. File Prosopis.

Star Plot
These graphs are used to represent concise and comparative multivariate observations. Each
variable is represented as a radius of a star. The magnitude of the radius is given by the
value of the variable represented by the star observation. If several observations are
represented in a single star, i.e. several observations of the file have the same value for the
classification criterion; the radius is a function of the average value of each variable. Thus,
different shapes of stars indicate the variables that make most difference between the
observations. Consider Iris file containing the values of four variables observed in flowers
of 50 individuals from each of three species of Iris. Using star charts of these four variables
and used as classification criteria to species, we obtain Figure 68. To achieve this figure
activate the Star with endings option. Figure 69 is similar to the previous except that it has
also enabled Star closed. In both cases has been activated for the three series and using the
right mouse button IDs>Visible>Labels option. Other changes consist in modifying the
colors, symbols and the location of the stars.
In the figures shows that the variable "PetalWid" is more prominent in the species number 3,
while the variable "SepalWid" is for the species 1. Moreover, "PetalLen" and "SepalLen"
are comparatively smaller in species 1 than species 2 and 3.

259

Graphs
3

SepalLen

Specie 3

Specie 3
PetalWid

SepalWid

PetalWid

SepalWid

PetalLen

PetalWid

PetalLen

SepalLen
SepalWid

Specie 2

Specie 2

Specie 1

SepalLen

PetalWid

SepalWid

PetalLen

PetalWid

SepalLen

PetalLen

Specie 1

SepalLen
SepalWid
PetalLen

SepalLen

PetalWid

SepalWid

PetalLen

Figure 68: Star plot using the four response


variables in file Iris.

Figure 69: Star plot using the four response


variables in file Iris and the option Closed
stars.

If there is no variable in the file to be used as classification criteria, the user can obtain a
"simple star", to represent the relative magnitudes of the selected variables. This type of
stars makes sense in the case of measurable variables, i.e. variables expressed in the same
scale. In this case the comparison of the lengths of the radii allows to infer differences
between variables. When is graphed more than one star chart (using one or more variables
as criteria) the comparison must be made between the radio counterparts of the different
stars and not between the radii of the same star.

Pie plot
The
pie
chart is useful to represent percentage
contributions to total frequency or
distribution of a categorical variable. For example, if you divide the household expenses on
food, service and taxes, education and others, then a pie chart shows the proportional
contribution each of these items in the composition of total spending, as shown in Figure 70.
Others (16% )

Education (20% )
Alimentation (49% )

Taxes-Sevices (15% )

Figure 70: Pie graph showing the distribution of family expenses.


To construct this graph, InfoStat offers two options depending on how the information is
arranged within the data table.

260

Graphs

Columns Categories: Data are presented featuring the various contributions in the columns
of the table, as can be illustrated in the following window. In the variable selection window
for this graph the columns of the table, identifying the contributions to be allocated to the
Classes list (factors in the pie).

If the data table is more than one row, InfoStat uses the column totals for the proportional
contribution of each item. If, moreover, there is a clustering approach, you can put on the
same graph several cakes, one for each grouping criterion. Also, if for each of they have
multiple rows of data, then the total column for each grouping criterion is used to build the
cake.
For example, if the above table adds a column "Family" which indicates whether the family
lives in an urban or rural environment and the variable selection window you choose
"Family" as a grouping criterion will a graph in which the distribution of spending appear
discriminated as if the family lives in an area "urban" and "rural" (Pie file ).

Series in a pie chart representing the various sectors. When several cakes are plotted on one
graph, a series called additional identifying groups that are not displayed in the chart but is
responsible for the subtitles to appear that identify each cake. This series should be at the
start (default) or bottom of the list of series for the title of the cake is located at the top. If
this series becomes invisible, then the identifiers of each cake disappear.

261

Graphs

Family Urban
Others

Education

Family No-urban
Others
Education
Taxes-Sevices
Alimentation
Alimentation

Taxes-Sevices

Figure 71: Pie graph for expenses in Urban and no-urban


families. File Pie.
Selecting the remaining series is possible to apply several
modifications to the basic diagram that generates InfoStat.
In the options panel, you can specify: the overall size of
the cakes (applies to all series, whether or not selected),
the separation between sectors (in the example was
separated sector "education"), if you want 3D effect to one
or more sectors, the relative position of the sectors
and their colors. You can also selectively change the
content of the labels associated with each sector being
able to
choose between 7 different
formats
and
no lettering. Editing the name of the series changed the
contents of the labels.
When the number of cakes in a graph is greater
than one, InfoStat allows the user
to
indicate
how
many columns you want to submit the whole cake.
Order flap, said to the system which ordering was
taken to make the chart. When these diagrams are
generated using a classification scheme, the diagram
shows various cakes ordered arbitrarily. Selecting the
cake you want to move and using Ctrl. or Ctrl. cakes can be ordered according to the
desired position.
The scales of the X and Y axes as well as themselves, are hidden by default because in the
context of pie charts have no useful interpretation and represent only the scale according to
which the cakes are ordered. These scales can be modified by the user if he/she thinks it is
convinient, minimum and maximum changing to adjust details of positioning of the cakes.
An important fact is that if the user changes the number of columns or the size of the cake,
InfoStat automatically recalculates the scale of these axes by returning their default values.
On the other hand, if you turn on the Ring, in the example shown, we obtain the following
graph.

262

Graphs

Family No-urban

Family Urban
Others

Education

Others
Education
Taxes-Sevices
Alimentation
Alimentation

Taxes-Sevices

Figure 72: Pie graph for expenses (with ring option activated) in Urban and no-urban
families. File Pie.
In some situations the label associated with one or more sectors does not change when you
apply a new option entitled. This may be due to two causes: the corresponding number is
selected or the label has been edited. In the last case, selecting the label, press the right
mouse button and select the Reset item. With this, the label back to its original position and
may be modified according to the choice of lettering. The effect of being edited or moved a
sign also affects their ability to "follow" the sector if it changes its relative position within
the cake. This situation is also corrected by selecting the label by pressing the right mouse
button and using the Reset item. If you change the name of the publishing sector the
corresponding series in the Series tab, then the reset option to run only change to the
original position and edits made retrospectively on the label, but keeping the name of the
series.
As a practical advice, it is recommended that all changes relative positioning of sectors,
separation, sorting on columns and signage options to choose before making minor
adjustments to the positioning of the labels within the image. That way you avoid unwanted
effects.
The Categories in rows option to plot cakes assumes the data that identify the different
categories are in a column. What InfoStat calculates to put together the various sectors of
the cake are the relative frequencies of each category, either by counting the number of each
one of them or using the frequencies that can be declared optional. Also in this mode can get
cakes for subsets of data defined grouping criteria. Once we have the cake, all the options
discussed above are applicable.

Stack bar plot


The stacked bar chart applies when one wants to represent comparatively the contribution
that different components make a total. For example, if the weight of a plant is divided into
the weight of root, stem and leaves, then the contribution of each of these partitions to the
total weight can be represented as segments of a bar whose height is 1. Then, if the weight
of the leaves is the largest contributor to the total weight, the segment associated with the
weight of the leaves will be larger. Stacked bar chart provides the default presentation of the

263

Graphs

contribution of each partition as a proportion of the total but it is possible to order the
representation of the absolute values of the contribution of each partition. In the last case the
height of the bars is variable.
The series that make up the various segments of the stacked bar can be changed relative
position and can be made invisible. In this case the proportions are recalculated. Eventually,
these segments can be highlighted making them narrower or wider by changing the size of
the graphic elements. Visible connectors reflect the profile of variation of each segment in
the bar.
Figure 73 shows the proportional contribution of total net earnings for each year of the
headquarters and four branches of an agricultural business (profits file.) In this example is
showed how the total net profit per year were formed with partial contributions from the
parent company and branches. Figure 74 shows the same graph but representing the absolute
values of these contributions.
1.0

Cumulative proportions

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

Year
Headquarters

Branch1

Branch2

Branch3

Branch4

Figure 73: Stack bar plot of profits by year for headquarters and four branches expressed in
proportions. File Profits.

264

Graphs
600

Cum. Values

450

300

150

0
1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

Year
Headquarters

Branch1

Branch2

Branch3

Branch4

Figure 74: Stack bar plot of profits by year for headquarters and four branches. File Profits.
Matriz de diagramas de dispersin

Scatter plot matrix


These diagrams can produce in one chart a matrix of scatterplots. This is useful for
visualizing relationships among a set of variables. Figure 75 shows this form of
representation in relations between the "PG", normal seedlings (PN) and seedling dry weight
(DS) for a germination test file (Atriplex). Figure 76 shows the same chart to which you
have added smoothed based on the technique of locally weighted regression (LOWESS).

Germination

Germination

Normal seedlings

Normal seedlings

DW

Figure 75: Scatter plot matrix for


Germination, Normal seedlings and dry
weigth (DW). File Atriplex.

DW

Figure 76: Scatter plot matrix for


Germination, Normal seedlings and dry
weigth (DW) with a smoothing. File
Atriplex.

265

Graphs

Function ploter
The functions grapher is a graphical tool of InfoStat for graphing functions one-specified
variable by the user. The user can specify one or several functions simultaneously. In the
dialog box shown below have been specified two functions of the variable x, natural
logarithm (ln (x)) and cos (x). Upper and lower limits are the limits within which is graphed
or functions. N points are the number of points that divide the plotting interval.

According to the previous screen specified in the graph obtained as follows:

Escala comn

2.41

1.53

0.65

-0.22

-1.10
1.00

3.25

5.50

7.75

10.00

Figure 77: Graph obtained for two functions of the variable x, natural logarithm (ln (x)) and
cos (x).

266

Applications

Aplications
Quality control
In this version InfoStat provides the user control diagrams commonly used in quality control
of production of goods and services. The full version of the Quality Control Implementation
of InfoStat offers a full menu of actions to perform statistical quality control. This manual
only documents the techniques offered in this release.
The quality of a product or service is defined as its usability demanded by the market.
Production processes can be controlled from one or more measurements of quality
characteristics. The quality parameters or characteristics are those attributes or variables of
the product that describe their ability. A key concept is the variability or dispersion
(difference between the values of a set of measurements) of these measurements. The
random component of each measurement is supposed to arise from the addition of random
components from various different sources of variation and/or error. The total variation of a
set of measurements can be decomposed into a sum of measurements of variation due to
sources that affect the process. It is important to distinguish between common causes and
special causes of variation. Common causes can be caused by numerous factors, while
affecting the distribution of the measured trait, do not prevent the system is
predictable. Special sources of variation generate uncontrolled, unstable and unpredictable
variation. The statistical tools for quality control are intended to reduce the variability of the
parameters of quality through process analysis and comparison with established standards to
provide information useful in the design of actions to correct problems caused by specific
sources of variation. In the framework of statistical quality control system is called stable
casualties than where the quality characteristics showed small variations (natural variability
or common causes). A process that works with only chance causes of variability is
considered in statistical control. A measurement system is then said that is under statistical
control when the variation in measurements is only due to common causes and special
causes not. In this case the variability of the measurement system is small compared to the
variability of the process and/or specification limits or tolerance. The sources of variability
that are not part of the casualities causes are known as attributable causes. A process that
works in the presence of attributable causes is considered out of control (Montgomery,
1991).Statistical process control to identify and track changes between the two states of the
process. The processes are regularly assessed to ensure smooth operation requirements. The
evaluation of the statistical properties of the process are documented observation log sheets
that records all the features of the measurement system (measured variable, operations,
measurement equipment, personnel, sample size, standards, limits, etc.) the data gathered,
the diagrams obtained and the conclusions drawn from it.
For process control, this version of InfoStat allows to obtain control charts for both discrete
and continuous attributes. Other useful statistical techniques to model the relationships
between variables that determine the output of a production process and quality of the
267

Applications

product, can be easily implemented since basic modules of InfoStat such as linear and
nonlinear regression, analysis of variance (for study causes that affect the system),
multivariate analysis and time series. In quality control, the process under study can be
considered as the generating process of a population of one or more values of random
variables on which you want to infer on the basis of information contained in a sample
drawn from that population. Therefore, the tools of descriptive statistics provided in the
submenus of InfoStat summary measures and frequency tables that can be used to
characterize the sample of a process. In these submenus, from a sample of records from a
property of interest, you can automatically construct frequency histograms, frequency tables
and calculate means, variances, ranges, percentiles, etc. When there is sufficient knowledge
to propose models of probability distribution to describe the process under study (we know
the parameters of the process or the distribution of the random variable under study), the
user of InfoStat can through Probability and Quantile submenus estimate likely to respond,
from Statistical Theory, questions of interest in quality control such as: what is the
probability that a product taken at random from the production line contains at least one
defect if the process works as expected ?. When process parameters are not known, the user
can resort to using statistical techniques to infer these parameters. InfoStat to estimate
parameter values using the technique of confidence intervals and testing hypotheses about
expected values and variances for the case of a sample and allows comparison of more
parameters of a process when you have more than one sample.
The diagrams and control charts, offered within the application of Quality Control, allow
estimation of the parameters and standards that govern a process under control preliminary
samples from the control of products online from monitoring the process using known
standards and the estimation of process capability. A control chart is a graph where the
values of the studied quality characteristic is arranged on the y-axis for different samples or
for the same sample at different time points are identified on the x-axis . Three lines
(parameters of the diagram) accompany the series plotted: the middle line (drawn at the
average values of the series to a low state control) and the lines for lower and upper limits of
control (range between are expected to fall almost all the observations of a process under
control). Points outside the region determined by these limits suggest that the process is not
under control. Even if the values of the observed series are between these limits, the process
can be questioned for not having a pattern of random distribution of points around the
midline. Values systematically higher or lower than expected suggest a process out of
control.
The limits are derived from the term limit=mean k*standard error, where k is an integer.
When k=3 the intervals are known as three sigma intervals and is usually represented by the
Greek letter sigma to standard deviation of the mean or standard error property. The
amplitude between the control limits depends on the variability of quality characteristic, the
sample size used to construct the limits and reliability required (3 sigma, 6 sigma,
etc.). Charts and diagrams built usually called Shewart.
At other times the center line and control limits are externally determined. For example chosen by the
administrator of the process according to established standards or calculated from preliminary
samples recorded do to control the process but to set such limits.

268

Applications

When using control charts, consider the observations that fall outside the limits as well as
the tendency of the set of observations and points that fall on the midline. The instability of
the process can be identified by any of these conditions.
Two types of errors can be committed to making a decision on the status of the process from
a control chart: 1) Type I error when it denies that the process is under control but in fact is
in that state, 2) Type II error, when it is accepted that the process is under control but in
reality is out of control. The probability of making type I error is set by the user to construct
the diagram if using probabilistic limits. InfoStat does not calculate probability limits but
limits the k-sigma type. With limits of the k-sigma type this probability can be
approximated if we know the underlying distribution of statistic used to construct the
control limits. For a normal distribution of random components, 3-sigma range for the
average quality of the process, produce a risk of type I error of 0.0027. The probability of
making type II error can only be calculated if one knows the distribution of the process
when it is not under control. Operating characteristic curves to visualize the type II error
probability under various stages of the process and indicate the strength of the diagram to
identify changes in states of different magnitudes (Montgomery, 1991). The size of type II
error can be read from operating characteristic curves. These are available in the full version
of the application.
Menu APPLICATIONS QUALITY CONTROL lets you select: 1) Atributte Control
Charts: Failure proportion (p), Number of failures (np), Number of failures per unit
(c), Average failures per unit (u), 2) Variable control chart: Mean and range (X-mean,
R), Mean and standard deviation (X-mean, S), Individual values, 3) Modify control
chart: Cumulative sum (CUSUM), Run sum, and Paretochart, and 4) process capacity
(pc and pck).
The diagrams and control charts for variables or attributes, the user can exclude subsets of
data from the analysis as many times as necessary to get the diagram for the process under
control. When establishing control charts based on preliminary data and that standards are
not available, it is often of interest to use the parameters of the diagram in-line process
control or control of future observations. For all types of diagrams produced, InfoStat
activates the box diagram of control parameters and inputs known values of the parameters
of the diagram (e.g. those estimated from a preliminary sample) to obtain a graph from the
values file but with the center line and control limits admitted to activate the option.
The process capability analysis allows us to investigate whether a process is running under
the specifications. This analysis is based on the distance between the results observed and
expected values under nominal or normal distribution. User of InfoStat can estimate process
capability when applying control charts for variables.

Control chart for attributes


Menu APPLICATIONS QUALITY CONTROL ATTRIBUTE CONTROL CHART,
produces control charts for quality characteristics that are not measured on a quantitative scale. These
diagrams are useful for situations where each product inspected is classified as compliant (no defect)
or nonconforming (defective) with quality specifications. When the quality characteristics are
discrete, as in this case, are called attributes. Under this option, InfoStat allows to obtain diagrams for
the proportion or percentage of conforming products resulting from a production process (Failure
269

Applications
proportion (p)), for the number of dissatisfied produced (Number of failures (np)) and
diagram control of nonconformities per unit (Number of failures per unit (c)).

The proportion of non-conforming, denoted by p, is the ratio between the number of


defective products and the total is produced. A non-conforming article is one who does not
meet at least one of the quality characteristics that are evaluated simultaneously. When
analyzing the proportion of non-defective products instead of the proportion of
nonconforming obtain a control chart for process performance. The control chart for the
proportion of defects or p is a schematic diagram of control Shewart.
The diagram is based on the distribution of the proportion of defective (Binomial
distribution). For a process under control, we assume that p is the probability that a part is
defective or not meeting the specifications. It is also assumed independence among the parts
produced. If the random variable X represents the number of defectives in a sample of size
n, X is distributed as a Binomial with parameters n and p. The expected value (mean) for the
number of defects is np and variance np(1-p). The sample estimate of p is p =

X
, with
n

mean p and its variance p(1-p)/n. The reference lines for Shewart diagram based on the
theorical proportion p (reference ratio) are the upper and lower limits of confidence (UCL
and LCL respectively) and are constructed as follows:

p (1 p )
n
Central line = p
LCL= p + k

UCL= p k

p (1 p )
n

In the case of ignoring the value of the proportion of nonconforming p, of a process under
control, InfoStat will plot the sample estimates of p obtained from subsequent samples of n
units and construct reference lines from the observed values (estimators). Usually the
nonconforming fraction of the process under control is unknown. The lines displayed in the
control diagram automatically produced by InfoStat are calculated from observations in the
file and therefore correspond to test limits. These are obtained as follows:

p (1 p )
n
Central line = p

UCL= p + k

LCL= p k

270

p (1 p )
n

Applications
m

where p =

p
i =1

is the average of the proportions of defective through m samples of size n,

each with proportion defective denoted by p i . InfoStat requires that the user inputs the value
of k for the construction of the limits, k is 3 by default.
If any sample is outside of the control limits test, usually, after considering the possible
causes of this event, discard the sample and recalculated the test limits. This process is
repeated until all positions are within the limits, and in that time accepted the test control
limits for the current application (Montgomery, 1991). Facilities of InfoStat to enable and
disable cases allow to adjust the limits in a simple way. Subsequently, by checking the
control chart of known parameters and entering parameter values the plot obtained from
preliminary samples, it is possible to obtain a new graph from values in the file but with the
center line and control limits now known.
Example 21: In a production line of elastics car, about 30 samples of size 200 each one and
recording the number of defects per sample. The data is in the Diagram_P file, whose
columns are called "size" and "defective#".
For the diagram, select the menu APPLICATIONS QUALITY CONTROL and select
Failure proportion (p). A window will appear called P Control Diagrams which lists the
variables of the file. The user must select the variable containing the number of defective (in
this example, the column "Defectives#") and enter it into the Failure number subwindow,
the column of the file containing the size of each sample (in this example, "Size") must be
passed to the subwindow sub-groups Total. If necessary you can specify a variable that
contains dates or indexes the time of extraction of the sample in some way, to be used in the
x-axis of the diagram in the Time (Optional) option. By default the diagram assumes that
the chronological order of sampling is used to access the data. To OK, another screen is
displayed called P Control Diagrams in which the user can modify the value of k. The
following figure shows the diagram obtained for k=3. In the Output window will list the
values of the upper and lower limits of the central line.
Table 68: Control limit for failute proportion. File Diagram_P.
Control limit:Failure proportion
Control limit
Upper line
0.1382
Center line
0.0805
Lower line
0.0228

271

Applications
P control diagram
0.1459

Failure proportion

0.1132

0.0805

0.0478

0.0151
1

15

22

29

Chronological order

Figure 78: Control p diagram. File Diagram_P.


If you have information on p, say you

specify a target value of p=0.05, the ULC will

be

0.05(1 0.05)
LSC =
0.05 + 3
=
0.0962 and the LCL=0.0038. In the Control limits
200
window, select known control diagram parameters and entering the value of such a graph
can be obtained from the values in the file but with the center line and control limits
obtained from the proportion p known.
At times the nonconforming fraction of each sample is calculated on different sample sizes.
InfoStat lets you enter sample sizes for each sample, in which case the control limits based
on an average sample size, calculated automatically. This action produces approximate
boundaries. Duncan (1974) discusses the calculation of sample size needed to have a high
probability of detecting a change in an amount specified in the process.
The control chart (np) or number of failures (np), is constructed similarly to the above
diagram, but instead of plotting the fraction of non-conforming directly, plotted the number
of non-conforming. The reference lines of the diagram obtained from the data are calculated
as follows:

UCL =
np + k np (1 p )
Central line = np
LCL =
np k np (1 p )
The following figure shows the diagram of Number of failures (np) for the previous
example (Diagram_P file).
272

Applications
Np control diagram
29.1818

Failure quantity

22.6409

16.1000

9.5591

3.0182
1

22

15

29

Chronological order

Figuea 79: Np contro diagram. File Diagram_P.


Another diagram that can be selected is the control chart Number of failures per unit (c).
This is useful for situations where the sample includes a number of inspection units and
each of these units is the number of disagreements.
Example 22: If a producer of game shows 50 samples of n=10 videos each one and every
video has the number of defects, collect a set of 50 averages of non-conformities per
inspection unit (video). Diagram_C file.
If you enter in InfoStat column "Failures#" and asks diagram Number of failures per unit
(c), you get a graph where the limits are calculated from the Poisson distribution as follows:

UCL= u + k u

Central line = u
LCL= u k u

where u represents the average across all samples in the average number of nonconformities per unit. If the data file contains the number of non-conformities for each
inspection unit, first you must calculate, using the summary statistics menu, the average
number of defects per unit for each sample. The average resulting file will be used to
produce the control chart c.
To get it, activate the menu APPLICATIONS QUALITY CONTROL ATTRIBUTE
CONTROL CHART and select number of failures per unit (c). A window called C
Control Diagrams which lists the variables of the file. The user must select the variable
273

Applications

containing the average number of defects per unit inspected (in this example, the column
"Failures#") and enter the subwindow Number of cases per unit. Optionally you can
specify a variable that contains dates or indexes the time of extraction of the sample in some
way, to be used in the x-axis of the diagram in the Time (Optional) subwindow. By default,
the diagram assumes that the sampling data were entered in chronological order. To OK,
another screen is displayed called C Control Diagram in which the user can modify the
value of k. The following figure shows the diagram obtained for k=3. In the Output
window will list the values of the test limits and central line.
C control diagram
14.6154

Failures per unity

10.9616

7.3077

3.6539

0.0000
1

12

23

34

45

Chronological order

Figura 80: C control diagram. File Diagram_C.


The graph shows that the process is not under control.

Variable control charts


Menu APPLICATIONS QUALITY CONTROL Variable control chart, allows to
obtain control charts for quality characteristics that are measured in quantitative or
numerical scale. These diagrams are useful for situations where each inspected product
provides a measure for a quality characteristic. In such situations, the interest is to control
the average value of the property and some measure of variability. Within this option,
InfoStat yields diagrams the following: Mean and range (X-mean, R) and Mean and
standard deviation (X-mean, S). InfoStat builts diagrams under the assumption of normal
distribution for the quality characteristic under study. The resulting diagrams are
approximately correct for non normal distributions when working with large samples.
The diagrams for the mean and range allow quality control process average (X-mean chart)
and the variability across the breadth or range of the same (R diagram). These provide
information on the operating capacity of the process, while the X-mean chart is used to
control the variability between samples and variability in the process over time, the R
diagram allows you to monitor the variability within a sample or snapshot of the process
variability. Generally used 20 to 30 preliminary samples taken when the system is under
control, to obtain the values of and , the mean and standard deviation of the process,
respectively. For example, if you have m=20 samples of size n=5 each one, the estimator of
which represents the center line of the diagram, will be:
274

Applications

u =

X 1 + ... + X 20
20

and the estimate of the range is calculated from the sample ranges as

R + ... + R20
R = 1
20

where Ri is the range, or difference between the maximum and minimum value in the i-th
sample.
InfoStat builds the mean control chart (X-mean chart) using the following baselines:

UCL= u + k

Central line = u
LCL= u k
where UCL and LCL are the upper and lower confidence, respectively, and is calculated from the
range as

1 R
being the values of d2 those for the hope of the random variable or relative
n d2

amplitude (Montgomery, 1991).

InfoStat also represents the values of R successive samples in the range control chart (figure
R) using the following baselines:

UCL= R + k R
Central line = R
LCL= R k R
where the estimator R is also obtained from the distribution of relative amplitude,
standard deviation of

. The

, denoted by d3, is a known function of n. D3 values are tabulated in

Montgomery (1991) for different sample sizes. InfoStat calculated R from them as

R = d3

R
.
d2

Example 23: The Diagrama_MR file contains readings diameter ring forged piston for 25
samples of 5 rings each one (Montgomery, 1991).
For the diagram, select the menu APPLICATIONS QUALITY CONTROL
VARIABLE CONTROL CHART and select: Mean and range (X-mean, R). A window
will appear called Range control diagram (R) which lists the five variables of the file, each
one representing a sample observation. The user must select all the columns that contain
sample observations ("Obs1-Obs5") and entered on the observations per sample
275

Applications

subwindow. Optionally you can specify a variable that contains dates or indexes the time of
extraction of the sample in some way, to be used in the x-axis of the diagram. By default the
diagram assumes that the chronological order of sampling is used to access the data. To OK,
another screen is displayed called Range control diagram (R) in which the user can modify
the value of k. The following figures are shown in figures (for the mean and range) obtained
for k=3. In the Output window will list the mean and range for each sample and values of
the center line and control limits of the diagrams for the mean and range.
Average control diagram (X-Bar)
74.0161

X-Bar

74.0086

74.0012

73.9937

73.9863
1

13

19

25

Number of sample

Figure 81: Average control diagram. File Diagram_MR.


Range control diagram (R)
0.0513

Range

0.0385

0.0257

0.0128

0.0000
1

13

19

25

Number of sample

Figure 82: Range control diagram. File Diagram_MR.


The mean of the mean sample is 74.001 (central line of X-mean chart) and the average of
the sample range is 0.023 (central line of the diagram R). The values of d2 and d3 for n=5 are
2.326 and 0.864, respectively (Montgomery, 1991). The limits for the plots of means and
ranges for k=3 are:

276

Applications

74.001 3

0.023
2.326 5

and 0.023 3*0.864

0.023
.
2.326

Since the R diagram is envisioned that the process variability is observed under control Xmean chart. From the last is reported that there is no evidence of lack of control for the
average quality level of the process. These control limits could be adopted for the online
control process. If the R chart is displayed out of control points should first recalculate the
control limits and eliminating the causes attributable only after making interpretations from
X-mean chart.
InfoStat also allows to obtain diagrams of mean and standard deviation. X-mean charts
and S are used when the sample sizes are relatively large (n=10 or more observations per
sample) because in these cases the range (R) does not make efficient use of information. The
square root of the unbiased estimator is used to obtain S in each sample of n observations:
n

S=

(X
i =1

X )2

n 1

Since the expected value of S is c and the standard deviation of S is 1 c 2 where c is a


constant that depends on sample size, control limits for the S chart when you have a
standard value for are:

UCL =
c + k 1 c 2
Central line =
LCL =
c k 1 c 2
(n / 2)
. If unknown , the limits are calculated from the set of
[(n 1) / 2]
m samples tested using the statistic S / c , which is an unbiased estimator of with
1 m
S = Si and Si the sample standard deviation of the i-th sample. Then, the control
m i =1
1/ 2

2
where c =

n 1

limits are calculated as follows:

UCL =
cS + kS 1 c 2
Central line = S
LCL =
cS kS 1 c 2
Control limits for X-mean chart are defined on the basis of the estimated sample standard
deviation as:

277

Applications

UCL= + k

c n
Central line =
S
LCL= k
c n

The online control of a process can be done by checking the Known control diagram
parameters field (subwindow Range control diagram (R)) and entering the value of the
parameters of the diagram obtained from preliminary samples, from the process mean and
standard deviation ( calculated as =

R
).
d2

For Diagram_MR file were obtained X-mean and S diagrams invoking the menu
APPLICATIONS QUALITY CONTROL VARIABLES CONTROL CHARTS and
selecting Mean and standard deviation (X-mean, S). Appears a window called Standard
deviation and control diagram (S) where are listed the variables of the file, each one
representing a sample observation. We selected all the columns that contain sample
observations ("Obs1-Obs5") and included in the observations per sample
subwindow. Optionally you can specify a variable that contains dates or indexes the time of
extraction of the sample in some way, to be used in the x-axis of the diagram. By default the
diagram assumes that the chronological order of sampling is used to access the data. When
OK displays another screen called Standard deviation and control diagram (S) in which
the user can modify the value of k. The figure below shows the control diagram of standard
deviation obtained for k=3 (X-mean chart is not shown here, but you get it automatically
and interpreted in the same way as the previous example). In the Output window will list
the mean and S for each sample and values of the center line, the control limits and process
capability analysis. The results can reach the same conclusions as the previous example and
are very similar because the example involves sample size n=5 where the use of the range is
appropriate.

278

Applications
Standard deviation control diagram (S)
0.0206

Standard deviation

0.0154

0.0103

0.0051

0.0000
1

13

19

25

Number of sample

Figure 83: Standard deviation contro diagram (S). File Diagram_MR

Pareto diagram
Menu APPLICATIONS QUALITY CONTROL PARETO CHART allows to
obtain a bar chart showing relative frequency of different types of errors that can be detected
in a quality control process. The special feature of this diagram is that the errors are sorted
according to frequency from highest to lowest. In many processes is controled the number
of errors of different types on each unit inspected and that their importance may be different.
To make this diagram should have a file where the different types of errors are treated as
variables (columns of the file) and each record corresponds to an inspection unit. The values
to enter in each cell can be 0,1,2,3, ... and represent the number of errors of the type in study
for each unit.
Example 24: The Apareto file contains records of 6 types of errors made by two operators
on a total of 64 pieces inspected.
For the diagram, select the menu APPLICATIONS QUALITY CONTROL PARETO
CHART and select the 6 columns containing the error rate as variables. If you select the
operator and variable classification criteria (optional), InfoStat reports what proportion of
the total of each type of error was recorded by each operator. The chart below shows
selected 6 types of error and not qualified for the operator.

279

Applications

Figure 84: Pareto diagram. File Apareto

Process capacity
InfoStat reports on the Process Capacity (cp and cpk) by calculating the expected
proportions of off-specification products. The user must provide specification limits on
control diagrams subwindow in the specification limits field (processing capacity), which
occurs after selecting file columns involved in the analysis. The proportion of offspecification is the sum of the expected proportions of products with measurements below
the lower specification limit and above the upper limit. These ratios are estimated from the
normal model using the estimated mean and standard deviation of the process under control:

ELL

1 Z =
Lower : P( X < ELL) =

EUL

Upper : P( X > EUL) = 1 1 Z =

Total: P( X < ELL) + P( X > EUL)


where ELL and EUL are the lower and upper limits specified by the user (these are
determined externally and are independent of the control limits). The values of Z variable
used to calculate these probabilities are known as bilateral tolerances.
Table 69 shows the expected proportions for the example of Diagram_MR file, as are
reported in the Output window by InfoStat. The overall proportion of products outside the
specified limits (73.97 and 74.03 and rings required 74,0000.03 mm diameter) is 0.0020,
so it is concluded that 0.20% of forged rings have different diameters than those specified.
InfoStat shows under the name of bilateral tolerances the values of standard normal variable
corresponding to the standardization of the lower and upper specification. Finally, InfoStat
provides the fitness assessment process through cp and cpk statistics. These rates assume
that the natural limits of tolerance of the process are near specification limits. Whereas the
amplitude between the natural limits is equal to 6 sigma, is calculated by dividing the
280

Applications

amplitude ULE-LLE and 6 sigma. It is expected that the ratio of capacity or fitness of the
process, cp, is slightly larger than one if the process is under control. In addition InfoStat
cpk values reported for the minimum absolute value of bilateral tolerances divided 3
(Montgomery, 1991).
Table 69: Process capacity analysis. File Diagram_MR.
P control diagram
Process analysis capacity
Item Value
Mean: 74.00
S.D.: 0.01
LL:
73.97
UL:
74.03
Bilateral tolerance
Item
Value
Zlower: -3.22
Zupper: 3.01
Proportions out of specification
Item Value
Lower: 0.0006
Upper: 0.0013
Total: 0.0020
Process aptitude evaluation
Item Value
pc
1.04
kpc
1.00

Teaching aplications
Graphs of continious density functions
Menu APPLICATIONS TEACHING TOOLS GRAPHS OF DENSITY
FUNCTIONS, produces graphs of density functions and display the various forms that these
densities adopt by changing the parameters that characterize them. It is also possible to
shade and to get the size of areas under the curve defined by the density function. It is
possible to automatically obtain probabilities associated with different events of interest. By
accessing the menu the following window:

281

Applications

In this window you must select the density of interest, enable the fields to enter the
parameters that characterize them. To see features about available distributions see Chapter
Data Management for the DATA menu, submenu FILL ... this manual. The subwindow
Event defined by values... be used to define the random event for which you want to assign
a probability measure. The event definition can also be done after the graph generated for
distribution from the Graphics tools window.
To give an example of using the grapher, suppose you choose the Chi square density. To
activate Chi square appears the field to be entering the degrees of freedom (the only
parameter characterizing the distribution). After indicating the degrees of freedom of the
distribution and press OK, will appear the graph of the distribution accompanied by
graphical tools window. This window allows you to modify attributes of the chart. In the
Series Tab will display the fields for the distribution parameters used in this case shows the
v field (the degrees of freedom with which the distribution was generated.) If the value of
this field is modified, it will automatically modify the graph showing the distribution for the
new parameter value entered from the Charts Tools window. Parameter modification can
be done by entering a new value in the field or using the navigation rule that automatically
appears when you click on a field containing the parameter value.
To observe different distributions (series graphics) on a single chart provides the Clone
button. When there is a distribution chart in the Graphics window and used the Clone
button automatically get a copy (clone) of the original graphic series in the Series tab.
Activating the new series (from the Series tab) and change the value of their parameters is
possible to view both distributions at once. This procedure can be repeated as series graphs
(density functions) the user wants to display simultaneously on the same graphic window.
Below is a graphical window showing three chi-square distribution with 3, 6 and 10 degrees
of freedom respectively. For this graph is called a chi-square with 3 degrees of freedom (v),
then cloned by pressing the Clone button and selecting the cloned number indicated the
value 6 in v field, and finally cloned the last and changed the number 6 for 10 in the v field.

282

Applications

The event subwindow allows to define a random event to display the area corresponding
to the probability of occurrence of the same. It must check the box less than or equal to
(<=), greater than or equal (> =) or between and must enter the value/s that define the event.
For example, if the event corresponds to values of a standard normal random variable
between -1.96 and 1.96, having selected the normal density with mean=0 and variance=1, is
activated in the event subwindow the choice between entering the two fields values appear
+1.96 and -1.96. A graph is obtained as presented below:

283

Applications

The generated graph can read the probability of event of interest, in this case
p(event)=0.9500. If you check the Supplementary can read the probability of the
complementary event that is to include the variable values smaller than -1.96 and values
greater than +1.96. The shaded region corresponds to the complementary event.
Integral field, in the Graphic tools for continuous density functions, allows to visualize the
distribution function (or cumulative distribution) of the density corresponding to the
selected series. By activating this field the graphed density is replaced by the integral of the
same in the domain of values of the random variable under study.

Using the density function plotter for displaying hypothesis testing concepts
This application of InfoStat can be used to illustrate the hypothesis testing procedure.
Suppose you have a density that depends on an unknown parameter and a random sample
(X1,..., Xn) of n size. The problem of hypothesis testing, i.e. a problem for which there are
only two possible actions (accept or reject the hypothesis), means deciding (based on the
value of a statistic calculated from the available sample) if the unknown parameter of the
density, say , is the same, lower or greater than a specified value arbitrarily, say 0 (it is
assumed that the hypothesis is a parameter of the distribution).
For example, if you have designed a new treatment for a disease could be the hypothesis
which states that the average response of patients that receiving the new treatment is greater
than a value 0 representing the average response of patients that not receiving the new
treatment (control group patients).
A simpler problem is to consider the situation where we want to prove that the distribution
parameter is equal to a value 0 or 1 (it is assumed that the parameter assumes one of two
values). Then you have two hypotheses:
H0: = 0 and H1: = 1

284

Applications

Once the experiment from the observed value in the statistic of test is rejected or not H0.The
sample space for the hypothesis test should be divided into two parts: R0 and R1.It is denoted
by R0 to the region associated with the acceptance of H0 and remaining R1 to the region
associated to the rejection of H0. This means that if a random sample produces a value of
test statistic belonging to R0 will think that the hypothesis H0 is correct. Otherwise we reject
H0 and H1 are left with the hypothesis.
Because a decision will be taken based on a random sample, it must be remembered that
there is a chance for error. An error (error type I) can occur when despite H0 is true (when
parameter assumes as its value in the population), the sample produces a value of test
statistic belonging to R1 and therefore decide incorrectly for H1. The other possible error
(type II error) refers to situations where the true hypothesis is H1 but the sample produces a
value of test statistic belonging to R0, so do not reject H0 and thus incorrectly decide for H0.
The probabilities of improper actions are denoted by:
=probability of rejecting H0 when it has been postulated correctly
=probability of not rejecting H0 when H1 hypothesis is correct.

The chances for error are known as size of the error, is the size of type II error and is the
size of type I error. As the actions to be mutually exclusive in a particular experiment can
only be committed error of type I or type II error.
R1 region is traditionally known as critical or rejection region for testing H0. The probability
of obtaining a sample point belongs to the critical region when H0 is true, i.e. , is also often
called the size of the critical region.
The problem related to the null hypothesis is to find a critical region of size that
minimizes . Via the motto of Neyman-Pearson, derives a method for obtaining the critical
region of size that minimizes respect to all critical regions whose size does not exceed
(best critical region). The best proof of this hypothesis is based on the best critical
region. Then the problem reduces to identifying the values that limit or define the region
(values or critical points).
The probability 1- to reject a false null hypothesis is known as the power of the test. The
power of the test will be lower as grows. InfoStat allows to illustrate how decreases, as
sample size increases, the value or the distance between the parameter values specified
under H0 and H1 (null hypothesis and alternative hypothesis).
As a numerical illustration consider the problem of testing whether the mean of a random
variable in the population is 50 or 52 from the data in a random sample of size 25. Suppose
we know that the normal distributed random variable with variance 2=100 and assume that
the average of the sample, X , is 54. Then the hypotheses to be tested are: H0: =52 and
H1:=50 where the parameter represents the average of the variable under study, in this
example =.
R1 critical region is defined by the values of X c, where c is chosen so that P ( X c|
=50)=. Taking =0.05, the value c can be obtained in InfoStat as follows:

285

Applications

To build statistical distribution under the null hypothesis. This is normal with parameters
mean=50 and variance=4, since if X is normal distributed with mean=50 and variance=100,
the X statistic is normal with =50 and variance=100/25=4.
In event defined by values... activate the Greater than or equal to..., will the critical point
c, because InfoStat reports automatically 0.95 quantile of the distribution to activate the
option. Then c=53.28 is the point that defines the regions R1 and R0. In the Graphics
window will display the distribution and the shaded area corresponding to the probability of
the event.
If you wish to obtain critical regions of other size, the user must enter the critical value for
the field that is enabled by defining the event.
The critical point is that which satisfies the equation:
=
Z

c 50 )
(=
100

1.645

25

The critical region corresponds to the sample points for which X 53.28. Then if the
observed value of the sample mean is 54 this belongs to the rejection region and should
reject H0 in favor of H1.
We now consider the numerical problem of calculating , assuming 0=50 and 1=52, n=25,
variance 2=100 and =0.05. Remember that =P( X R0| H1), the probability associated
with the event "the statistic belongs to the acceptance region given that the null hypothesis is
false." Then, =P( X 53.28| =52). To get the value in InfoStat you can follow the next
steps.
On the graph above generate the distribution of the statistic under the alternative
hypothesis. That is graph a normal density with parameters mean=52 and variance=4.To
achieve this we must Clone the existing graphics series and change the middle-income 52,
work done from the graphics tools window.
Under Event activate the <= and write 53.28 in the field. . The shaded portion of this
distribution corresponds to . You can read under the title of the graph, the value of the
probability of type II error as p(event)=0.7405.

286

Applications

The examples presented illustrate how you can use InfoStat to work the geometric
meaning of and . The same example could have been addressed from the statistical
distribution of Z. Using other densities could illustrate the procedure for statistical
hypothesis testing related to the variance or other unknown parameter distribution.

Confidence intervals
Menu APPLICATIONS TEACHING TOOLS CONFIDENCE INTERVALS allows
to obtain by simulation a set of confidence intervals for the hope of a normal random
variable. The purpose of this procedure is the empirical view of the concept of underlying
confidence in the estimation procedure for confidence intervals.
The method to estimate by confidence intervals determines the accuracy in estimating the
parameter of interest. The usual method is described through the classic example of
estimation of the mean of a normal density with known variance 2, from a sample of size
n.
In this case X is the best unbiased estimator of , so the estimate will be based on X , with a
normal distribution with mean and variance 2/n. This distribution allows to calculate the
probability of finding values in a specific range around a parameter . It is also known that
( X ) has standard normal distribution, so P(| Z |<1.96) = 0.95.
the variable Z =

X )

(
Then P 1.96
1.96 =
0.95
2

Rearranging we have:
287

Applications

X + 1.96
P X 1.96
0.95
=
n
n

If this probability is interpreted operationally in terms of the relative frequency of the event
indicated over many repetitions of the experiment of sampling, the confidence interval sets
precedent that 95% of the intervals of the form obtained previously, each one from samples
of size n, will contain the mean in its inside.
The range thus achieved is known as the confidence interval for 95% for and the
minimum and maximum range are known as the confidence limits for . InfoStat allows to
illustrate the underlying operational concept of probability to estimate the confidence
interval because, from the distribution of variable X defined by the user, obtained by
simulating a number of samples H and calculated for each sample interval limits of trust
described above. The H ranges obtained are arranged in a graph for comparison. On the
same graph InfoStat trace on the vertical axis the value of the parameter (Mean).Clearly in
practice it has only a sample and do not know the half of the distribution, it is precisely this
parameter that we want to estimate. However in this application the underlying distribution
is assumed known (the mean, variance and sample size are entered by the user). Then the
graph that InfoStat showed is constructed from repeated samples taken from this distribution
(simulation of sampling).
The user can also indicate the level of confidence with which to work. By accessing the
menu the following window:

Filling the required information and pressing OK will generate a graph with its
corresponding graphical tools window such as:

288

Applications

The graph shows the center of each interval indicated by a point and below the title, the
percentage of intervals containing the true parameter value. Note that there are 4 intervals
with a different pattern, they are the ones that do not contain the parameter. In the graphics
tools window can change the number of intervals to display (intervals), the confidence level
for each interval constructed (confidence), the sample size in each interval is calculated
(sample size) and the size of the dots indicating the center of each interval (Size). Draw
contours button allows joining the ranges obtained through its boundaries. Pressing this
button opens the boundaries and Color option is activated to paint the inside of the contours
generated. Draw contours and Color buttons change their name once activated, allowing
undo the actions requested (Clear contours and Discolor respectively).
Build another set of intervals button allows for other confidence intervals for the mean
from the simulation of a new set of random samples, 100 samples in this example.

All possible samples


Menu APPLICATIONS TEACHING TOOLS ALL SAMPLES, allows to obtain all
possible samples from a set of observations on a particular characteristic and a set of sample
statistics on each of the samples generated. This module can be used for educational
purposes for visualization of the sampling distribution of means, variances and proportions.
InfoStat requires the selection of a variable from a data table, which can be integer or
real. The read values are presented on the screen generator samples. The user must enter the
size of the samples to generate and InfoStat will generate all possible samples of that size
from the data set list. The user can select a simple random sampling or systematic (see
Statistics, Estimation of population characteristics). For each sample may be required: the
identification of the elements comprising the values of each observation and the sample
289

Applications

statistics total, mean, proportion, variance and variance corrected (is corrected for finite,
that is multiplied by (N-1)/N).
By invoking the submenu All possible samples appear selector variables to indicate the
column of the table containing the values of the variable in the population studied and the
column/s to define partitions in case you want to work with partitions file. To OK a window
that displays the size of the population, indicate the sample size and obtain the number of
samples possible. You must select the type of sampling: simple random (srs) or
systematic. At the bottom of the window shows the population mean and population
variance of the values of the variable specified. Under Select the result can be either:
Index of sampled values, values observed in the sample, sample mean, estimated total
population sample variance, corrected sample variance and successes ratio. After
setting options, select the Get samples button. The values appear in a new data file, which
by default InfoStat will call sample mean (if you leave this Select the result option), or
variance, proportion, etc., depending on the statistic that this sampling.
To illustrate the use of this application, consider the following procedure which will produce
a set of sample means from a population of 30 observations and visualize the distribution of
means:

290

Create a new table containing 30 rows.

Fill the cell contents with the values of a random variable. In this example uses the
normal distribution with expectation and variance equal to 100 to generate the
starting population. This is accomplished by invoking the Data menu, Fill with...
submenu, Other option, indicating on the dialog box and entering the normal
distribution parameter values, in this case mean = 100 and variance = 100.

The 30 records obtained represent the population from which it will obtain all
possible samples of a given size. In this example, all samples were of size 3. This is
accomplished APPLICATIONS TEACHING TOOLS ALL SAMPLES and in
the window all possible samples, the values in the population field, select the
name of the column that contains the newly generated population data. Then, in the
subsequent window, enter information into the appropriate fields. In this case
sample size was entered 3 and Sample mean you checked the averages for each of
the samples generated by pressing then the sampled button. It will generate a file
called Media sample with 4060 records.

To visualize the distribution of the sample means obtained should open the
previously saved file, then go to Menu GRAPHICS HISTOGRAM.
Automatically display a chart like the one shown below:

Applications

Figure 85: Histogram and frequency polygon for the variable sample mean for the
population of all samples of size 3, obtained by random sampling without replacement from
a population of 30 observations with normal distribution with mean 100 and variance 100.

Sampling from the empirical distribution


Menu APPLICATIONS TEACHING TOOLS SAMPLES FROM THE EMPIRICAL
DISTRIBUTION allows to obtain samples from the empirical distribution of a dataset. The
user must specify the size of the sample to be extracted, as well as the number of samples
required. This number of samples can be stored in a table of data in multiple columns
(Samples in different columns) or in a single column (Samples in one column). This
module can be used for educational purposes to visualize the sampling distribution of
statistics derived from the empirical distribution.

Using Atriplex file, was chosen the variable Germination and requested 2 samples of size
100 each,first activate the option samples in different columns and then displayed in a
single column. The files generated for each case are as follows:
291

Applications

Resampling
The increasing importance of resampling methods (sampling from a sample) refer to
situations where the distributions of random variables under study are unknown or
intractable. In inferential procedures, it is common the need to obtain both an estimate of the
distributional parameter of interest is the standard error of the estimate, therefore become
important general methods of estimation. Important candidate methods to obtain these
values come from the idea of resampling. Jackknife (Quenouilles, 1949) and bootstrap
(Efron, 1979) have proven to be powerful methods in many situations because they provide
an empirical simulation of the random component associated with the statistic that is being
used in the estimation. The random nature of the generation of samples allows the
development of an empirical estimation of the sampling distribution of the statistic
considered. These methods cannot be applied indiscriminately, it is necessary to determine
its value in each particular case.
Given the growing popularity of these methods in statistical studies have been implemented
in InfoStat these basic resampling procedures to facilitate the teaching of computationally
intensive estimation methods. In addition to performing a jackknife or bootstrap resampling
from an original sample, the user can store the samples and calculate basic statistics about
them.

292

Applications

So it is possible, exercise to understand that for a population characteristic and a single


sample of size n, (X1, X2, ..., Xn), it is possible to estimate the standard error of a sample

calculation that provides an estimate .


The steps to estimate using traditional jackknife procedure are as follows: (1) Divide the sample
into n subsamples of size 1, (2) Separate a subsample from the original whole sample (leave out a
fact), (3) Calculate the estimate, Error! No objects can be created by modifying the field codes.
From the reduced sample size (n-1), i.e. the estimate after removing the i-th observation, i=1,2, ...,
n. (4) Repeat steps 2 and 3 for all possible subsamples, i.e. n times. These steps are done
automatically in InfoStat by selecting the option Jackknife Resampling menu. All the n estimates

obtained can be used to obtain a timely estimate and standard error , selecting the appropriate
summary measures. For example, a point estimate can be simply the average of jackknife of the n
estimates

(.) =

1
( i ) ,

n i

This average can also be used to develop a new point estimate, let's call J through the

calculation J = n (n 1) (.) .

The jackknife estimator of standard error is the square root of sample variance calculated

n 1
as: V ( ) J
=
( ( i ) (.)) 2

n i
The coefficient n-1 at the Jackknife variance equation is arbitrary. The rationale for its use
due to the fact that the variation between Jackknife samples of size n-1 is expected to be
small because the jackknife samples are n sets of fixed data obtained from the original
sample. The generic bootstrap procedure for the case of a sample is to extract a number that
is not fixed (as in Jackknife) of bootstrap samples. A bootstrap sample is a set of n
observations of the original sample of size n extracted from a random sampling with
replacement. In a random sample obtained by sampling with replacement elements from the
original sample may appear more than once and others not appear. After to obtain
automatically many bootstrap samples, say B samples, and calculate the statistic of interest
from each bootstrap sample, the user will have available a set of B estimates from which
empirically can be estimated and the sample variance. The bootstrap point estimator of the
parameter of interest is the average bootstrap
The sampling variance , obtained by bootstrap is sampling variance of the bootstrap
estimates,

293


=
V ( ) B

Applications

1
( (i ) B ) 2

B 1 i

As an application example below shows the results obtained by resampling a set of 30


observations related to the percentage of germination in seeds of Atriplex sp. Point estimates
are desired for the average percentage of germination.
Using menu APPLICATIONS TEACHING TOOLS BOOTSTRAP are requested as
Type of sampling option Jackknife and reporting of the sample means (checked by default
in the Save panel). To OK is generated a table with the 30 sample means. Then was asked
again but using the resampling technique to obtain 250 bootstrap sample means. Generated
tables are:

With the data and the original sample was calculated from the average value of each data
set. That is, activating the corresponding table, ask menu STATISTICS SUMMARY
STATISTICS and complete the required information. In this example the variable average

294

Applications

germination percentage is 65.20, the average of means obtained by jackknife resampling is


65.20 and the corresponding to the mean is 65.26 bootstrap resampling.

Indexes
Biodeversity indexes
This module of InfoStat has been designed and developed with the advice of Dr. Laura Pla
(Universidad Nacional Experimental Francisco de Miranda, Venezuela).
Through this submenu, InfoStat allows to obtain many levels of biodiversity. These indices,
such as heuristic approach to analyze plant and animal communities, are widely used in
ecological studies, landscape, genetic diversity, environmental risk and changes in patterns
of land use.
In biodiversity studies, sampling from communities, the sample size (number of units of
observation) may be small for parametric inference on diversity. However, it is desirable to
achieve estimates with known confidence levels. An alternative to parametric estimation for
the indices of diversity is the construction of confidence intervals (CI), using
computationally intensive techniques such as bootstrap.
In InfoStat can apply the following rates of Biodiversity: wealth (direct count), Chao
richness, Shannon-Weaver index, Simpson index, McIntosh's index, Berger-Parker
index, Bulla index and Kempton index. The software lets you apply transformations to
calculate expressions derived from these indexes: identity (I), reciprocal (1 / I), complement
(1-I), weight (I / ln (wealth)). For example, based on measures of dominance, as the
Shannon index, the user can obtain a measure of equity using the transformation weighting.
CI was obtained by three algorithms: 1) standard bootstrap or based in normal approach, 2)
bootstrap percentiles and 3) bootstrap bias corrected and accelerated (BCA). Even if you
have a single sample to estimate biodiversity, InfoStat build a CI index selected by the
methodological strategy proposed by Pla (2003).
For studies of regional diversity InfoStat allows incorporating one or more variables for the
classification of samples. In these cases, get estimates and CIs of the rates for different
levels of the hierarchical structure defined by the criteria for classification and for the total
sample.
The quantitative analysis of diversity through the index calculation can be done in two basic
forms:
a) When there is a single sample per community
b) When there are two or more samples for a community

a) Biodiversity based on a single sample


In the case of a single sample data to be processed by InfoStat assume a measure of
abundance (absolute or relative) of each species in the sample. The data table must contain a
column for each species and row (if any) the abundances of each species in the sample.
295

Applications

b) Biodiversity based on two or more samples


In the case of multiple samples for a community the data to be processed by InfoStat assume
a measure of abundance (absolute or relative) of each species (variables or columns in the
data table) in each sample or case (a transect, a subtransecta, a cell of a grid). Additionally,
the data table can contain one or more columns with classification variables, such as
community, city, region. Whenever the data table contains a number of cases as subunits of
the same sample should be included as a criterion for classifying a column that identifies
each sample.
In the menu APPLICATIONS OTHER INDICES OF DIVERSITY, will appear the
window measures of diversity, where the user must indicate in the Variables panel the
species, or units that make up the biodiversity, and Criteria to identify variables that
differentiate file samples (e.g. transects, country, region). InfoStat will calculate biodiversity
indices and confidence intervals for each sample, for each level of the classification variable
and the total. If there is more than one classification criterion the first variable should
correspond to the main hierarchy and the next variable is nested in the predecessor. InfoStat
accepts more than two variables of classification.
To OK displays a window where you must select the indices to be estimated. Bootstrap
intervals panel selects the bootstrap estimation method, in which confidence-building
measures will. In the same dialog box indicating the confidence coefficient for the CI to
obtain (Confidence%), the number of bootstrap samples with which estimates are made
and one of the following transformation index: identity (I), reciprocal (1 / I), complement
(1-I) or weighted (I / ln (r) where r is wealth.)
If no specific classification variables, InfoStat calculates indices and CI for each case or
sample. If you specify classification variables and also wish to CI for each case must
activate the box bootstrap estimates for cases and incorporate the variable where the
subwindow classification criteria.
If you check the totals for each level, InfoStat incorporates the window results the total
occurrences of each level of classification variables for which we calculated the index.

Brief description of the methods used to quantify biodiversity


Assuming a population with a total number of S classes (typically species in biodiversity
studies) no overlap, identified by i=1, 2, ... , S. We denote by i the proportion of the i-th
class. Because the classes are mutually exclusive, meaning that a single item or individual
can not belong to more than one class, the i are subjected to restriction

=
1.

i =1

Suppose a random sample of this population, we call Xi the number of individuals or


abundance of class i, if Xi=0, then the i-th species has not been observed in the sample. The
total number of species observed in the sample we call r, which can never be greater than S.
We denominate fk the number of classes with frequency or k abundance and thus, the total
number of individuals or total abundance (to) in the sample can be calculated as:

296

Applications

to =

i =1

can be expressed in terms of fk as: to =

kf

Furthermore the i estimate from a sample can be made from the relative frequencies as

pi =

Xi
to

The wealth estimates are based on single and double frequencies (f1 and f2) and the number
of classes or species actually observed (r). Estimates of biodiversity indices are based on pi
or relative frequencies of each species.

Richness
InfoStat can calculate the richness observed in the sample (r), which is the total number of
species present in the sample. This estimate is always a value no greater than the true wealth
of the community.
Chao richness: Chao (1987) derived an estimator for the total number of species present in a
community as:

S
=

r +

f12
2f2

f1 being the number of species-rich unit and f2 the number of species with abundance
two. When there are species with abundance 2 in the sample, the index can be calculated.

Shannon Index
The Shannon biodiversity index (Shannon and Weaver, 1949) is based on the assumption
that heterogeneity depends on the number of species present and their relative abundance. It
is a measure of the degree of uncertainty associated with the random selection of an
individual in the community. It is calculated as:

H =

pi ln pi
i =1

Maximum diversity (H=ln r) is achieved when all species are equally present. An index of
equity associated with this measure of diversity can be calculated as the ratio H/Hmax=H/ln r.
With the presence of several observations per sample classification or hierarchical structures
recommended the construction of CI by the BCA method (Pla, 2001). For samples with
frequencies or abundances only total to 500, is recommended to use the set method for
estimating the Shannon index as:

Ha = 2.73 H 1.75 H* + 0.0003 r*

297

Applications

where H is the index calculated with the equation (1.4), bootstrap estimate H* index and r*
bootstrap estimate of wealth. The construction of the CI is based on the bootstrap standard
deviation and limits are calculated as:

Ha z / 2 s*r
being s*r bootstrap standard deviation of wealth (Pla, 2004). For frequencies or total
abundance in the sample greater than 500 is recommended the use of CI directly by BCA.

Simpson Index
Proposed by Simpson (1949) suggests that an intuitive measure of diversity of a population
is given by the probability that two individuals taken independently of the population belong
to the same species.
The estimator of the Simpson index is calculated as:

D =

xi(xi 1)
1)

to(to

i =1

The index varies from 1/r (lower concentration or maximum possible r species diversity)
and one (higher minimum concentration or dispersion when one species dominates the
community). The Simpson reciprocal (1/D) can be interpreted as the number of equally
abundant species needed to produce the observed heterogeneity in the sample. Not
recommended the construction of confidence intervals when less than 0.02 or its reciprocal
is greater than 50 (Pla, 2003).

McIntosh index
The McIntosh index (1967) is an index of dominance based on the consideration that the
community is a point in a hyperspace defined by the species, and can be quantified as the
r

Euclidean distance from that point to the origin (

2
i

). If there are many species as

i =1

individuals with the highest diversity is the difference between the maximum (to) and the
community under study is a measure of absolute diversity (numerator of the index).
The estimator of the index is calculated as:

to
U =

2
i

i =1

to

to

and can be interpreted as the ratio of the absolute maximum diversity (the denominator of
Equation 1.8) to the total frequency. Varies between zero (minimum diversity, when a single
species) and one (when diversity is high and all species have a frequency one).

298

Applications

Bulla Index
In a graph of the relative frequency of occurrence of species (ordinate) versus the number of
species (abscissa) a horizontal line at 1/r represent a community with maximum diversity. If
it is superimposed on a line that represents the relative frequency in the community, and it

min p , to

calculates the degree of overlap between these two distributions

i =1

obtain the measurement of equity proposed by Bulla (1994).


The index is calculated as:

1
r
min(pi, r) r 1

O = 1 = 1
r 1
then is adjusted to vary between zero, when a species appears to absolute dominance, and
one when all species are equally present.

Berger-Parker Index
Originally proposed for phytoplankton populations (Berger and Parker, 1970), takes into
account only the most abundant species and is the simplest of biodiversity indices. It is
calculated as:

d = x max / to .
Kempton Index
It proposes an index that avoids the excessive weight of the most abundant and least
abundant species in the calculation of diversity. The index (Kempton and Taylor, 1976) is
the slope between the first quartile (p=0.25) and third quartiles (p=0.75) of the cumulative
lognormal distribution of abundance of species in descending order (abscissa) and
species (ordinate). InfoStat calculates it as:

1
1
1
f
+ r + f
2
2
Q =
= 2
k .75r
k .25r
=

log kfk / kfk


=
k 1=k 1

k .25r
k .75r
=
=

The index tends to zero when the first and third quartiles coincide, meaning that the species
'center' of the distribution contribute little to the accumulated wealth. The more
homogeneous is the distribution of species abundance between the greater the Kempton
index. CIs calculated by any of the bootstrap methods do not have good behavior when it
comes to a single sample, and there is another method to calculate the standard deviation is
unknown when the probabilistic model of source distribution.

299

Applications

Bootstrap intervals calculated by InfoStat


A confidence interval of level is defined as a set include of value of the parameter
(interval) with confidence (1-)100% will include of the parameter in the population, given
the variability and sampling distribution of the estimator in the sample observed. Thus, the
confidence intervals are constructed from parametric assumptions about the shape of the
sampling distribution of the estimator (Normal, Student t, Chi square, etc..). In the case of
biodiversity indices is not reasonable to assume a sampling distribution of the estimator and
therefore InfoStat allows a construction technique intervals based on nonparametric
resampling procedure known as bootstrap.
The bootstrap technique is to draw by random sampling with replacement B samples of size
n from the original sample size n. In each of the B bootstrap samples (default B=500)
InfoStat calculates the statistic of interest (in this case an index of biodiversity).
When selecting a classification variable and calculate the biodiversity index for a set of
rows, n is the total number of rows used for this estimate. InfoStat will total by species and
calculated the rate applied in each bootstrap sample.
To apply resampling methods to calculate the index for a single sample InfoStat assumes
that the population from which the sample is homogeneous and divided into "parts unit"
each as large as the minimum recognizable size, expressed as frequency. Thus, the
abundance of each species is divided into portions xi unit that can be sampled. As the
abundance is expressed on the same scale for all species, both in population and in the
samples (and resample) we can make a random selection of 'unit portions' characterized by
the species recorded in it. These are N virtual objects, which are assumed mutually
independent and have the same distribution (not known) of probability. This value of N is
the InfoStat making in these cases as 'sample size'.

Percentile estimation
You can sort ascending the B estimates and identify the quantiles to be used as limits of
bootstrap confidence interval of the parameter of interest. Thus, by selecting a percentile
estimate the limits of bilateral confidence interval (1-)100, corresponding to the
percentiles (/2)100 and (1-)100 of the list of estimates from B bootstrap samples
drawn from the original sample.

Standard interval by normal approximation


It is based on the assumption that the bootstrap estimator I* is approximately normally
distributed with mean and standard deviation , and therefore there is a probability (1-)
that
I + z /2 I < I* < I + z1 /2 I

for every random sample used in the estimation (Efron 1979).


On this basis it is possible to estimate the confidence interval limits as

LL
= I * + z /2 s *I
300

Applications
*
1 /2 I

UL
= I * +z

being I * the bootstrap estimator of the index, s *I is bootstrap estimate of standard deviation
and z / 2 and z1 / 2 corresponding values of the standard normal distribution.

Corrected interval by bias and acceleration (BCA)


To apply this method requires

The bootstrap distribution of the estimator I*

An estimate of the median estimation bias, i.e. what is the difference between the
mean (or median) of the sampling distribution of I and (parameter desired).

An estimate of the acceleration of the variance, that is how increases or decreases


the sample variance as (parameter) increases.

The value of , to determine the level of interval confidence (1-2)100%

The calculation of the acceleration (a) is carried out iteratively as:

a =

(I

i =1
n

*
(i)

I(*i))3

*
2 3/2
6 * ( (I(i)
I(*i)))
i =1

Being I(*i) the parameter estimator in a small sample in what has been omitted i-th
observation, I(*i)the mean of the estimates calculated with the reduced sample and n the
total number of samples.
To calculate the limits is used the corrected percentile method (Efron 1981). This technique
removes the bias that results from failure of the median in estimate the distribution of the
estimator of the parameter of interest. It can be done using the bootstrap samples to
represent the true bootstrap distribution, and assuming that there is a monotonic
transformation that is distributed as normal with mean zero and variance one.
Suppose that from 1000 bootstrap estimates are obtained 550 higher than the original
sample. Then, the ratio of 550/1000=0.55 (called p*) to replace the percentile of the
median. Knowing this ratio improves the estimation of confidence interval limits and
rescues the possible asymmetry of the distribution.
In the calculation of IC values are used standard normal distribution z (1-/2), z (1- /2), and zp*
corresponding to the likelihood /2, (1-/2) and p*, respectively. If p* is 0.50 then zp*=0,
and will obtain a symmetric interval.

301

Applications

The confidence interval by combining the bias correction and acceleration correspond to the
quantiles of the empirical distribution to match:

=
qLI z p*

z 1 z p*

1 + a z 1 z p*

for the lower limit, and with:

q=
z p* +
LS
for the upper limit.

302

z p* z

1 a z p* z

S-ar putea să vă placă și