Sunteți pe pagina 1din 8

Session 1 2: Introduction to R

Why Analytics?
Business Intelligence and Analytics (BI &A) and their related field of Big Data Analytics have become
increasingly important for business communities over the past two decades. There has been an explosion
of amount of data available in the world. The experts believe that by end of 2015, total amount of data
generation will reach a gigantic volume of 7.9 Zettabyte, where 1 Zettabyte is equivalent to 1 trillion
Gigabytes, Analytics is being used today and will be more used tomorrow for various applications
involving large volumes of data like life science research, analysis of consumer behavior, analysis of
social media usage, weather forecasting, accurate healthcare services etc. Thus, it has become very
important to understand, infer and learn quickly and often in real-time from the humongous data in
todays information-driven business world. It is quite natural that a large number of analytics tools and
software have evolved over the last few years to handle these challenges. However, the programming
language R is rapidly becoming the de facto standard among the professionals in the analytics industry
because of its extremely rich set of libraries and because of it being an open source tool.
Why R?
R is a flexible and powerful open-source implementation of the language S (for statistics) developed by
John Chambers and others at Bell Labs. R has eclipsed S and the commercially available S-Plus program
for many reasons. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at
the University of Auckland. In 1993 the first announcement of R was made to the public. Rosss and
Roberts experience developing R is documented in a 1996 paper in the Journal of Computational and
Graphical Statistics:
Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3):299314, 1996
In 1995, Martin Mchler made an important contribution by convincing Ross and Robert to use the GNU
General Public License to make R free software. This was critical because it allowed for the source code
for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software
later).
In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group
was formed, containing some people associated with S and S-PLUS. Currently, the core group controls
the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R
version 1.0.0 was released to the public.R is free, and has a variety (nearly 4,000 at last count) of
contributed packages, most of which are also free. R works on Macs, PCs, and Linux systems. In this
book, you will see screens of R 3.2.3 running in a Windows 7 environment, but you will be able to use
everything you learn with other systems, too. Although R is initially harder to learn and use than a
spreadsheet or a dedicated statistics package, you will find R is a very effective statistics tool in its own
right, and is well worth the effort to learn.
Here are five compelling reasons to learn and use R.

R is open source and completely free. It is the de facto standard and preferred program of many
professional statisticians and researchers in a variety of fields. R community members regularly
contribute packages to increase Rs functionality.

R is as good as (often better than) commercially available statistical packages like SPSS, SAS,
and Minitab.
R has extensive statistical and graphing capabilities. R provides hundreds of built-in statistical
functions as well as its own built-in programming language.
R is used in teaching and performing computational statistics. It is the language of choice for
many academics who teach computational statistics.
Getting help from the R user community is easy. There are readily available online tutorials, data
sets, and discussion forums about R.

R combines aspects of functional and object-oriented programming. One of the hallmarks of R is implicit
looping, which yields compact, simple code and frequently leads to faster execution. R is more than a
computing language. It is a software system. It is a command-line interpreted statistical computing
environment, with its own built-in scripting language. Most users imply both the language and the
computing environment when they say they are using R. You can use R in interactive mode, which we
will consider in this introductory text, and in batch mode, which can automate production jobs. We will
not discuss the batch mode in this book. Because we are using an interpreted language rather than a
compiled one, finding and fixing your mistakes is typically much easier in R than in many other
languages.

Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of
drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the
original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D
graphics (but things have improved greatly since the old days).
Another commonly cited limitation of R is that objects must generally be stored in physical memory. This
is in part due to the scoping rules of the language, but R generally is more of a memory hog than other
statistical packages. However, there have been a number of advancements to deal with this, both in the R
core and also in a number of packages developed by contributors. Also, computing power and capacity
has continued to grow over time and amount of physical memory that can be installed on even a
consumer-level laptop is substantial. While we will likely never have enough physical memory on a
computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a
bit easier over time.
At a higher level one limitation of R is that its functionality is based on consumer demand and
(voluntary) user contributions. If no one feels like implementing your favorite method, then its your job
to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect
the interests of the R user community. As the community has ballooned in size over the past 10 years, the
capabilities have similarly increased. When I first started using R, there was very little in the way of
functionality for the physical sciences (physics, astronomy, etc.). However, now some of those
communities have adopted R and we are seeing more code being written for those kinds of applications.

Getting started and using R

If you do not already have R running on your system, download the precompiled binary files for your
operating system from the Comprehensive R Archive Network (CRAN) web site, or preferably, from a
mirror site close to you. Here is the CRAN web site: http://cran.r-project.org/
Download the binary files and follow the installation instructions, accepting all defaults. Launch R by
clicking on the R icon. For other systems, open a terminal window and type R on the command line.
When you launch R, you will get a screen that looks something like the following. You will see the label
R Console, and this window will be in the RGui (graphical user interface).
Official Manuals
An Introduction to R
R Data Import/Export
Writing R Extensions: Discusses how to write and organize R packages
R Installation and Administration: This is mostly for building R from the source code)
R Internals: This manual describes the low level structure of R and is primarily for developers and R
core members
R Language Definition: This documents the R language and, again, is primarily for developers

Fig.1: My first R window


Let our journey begins.

Open R-Gui and try following


Note: R is very sensitive. So always it is advised to type in R prompt and copy to doc file but reverse may
lead to error.

1. Type in the following on R prompt:


3 + 9 + 12 -7
12 + 17/2 -3/4 * 2.5
(12 + 17/2 -3/4) * 2.5
pi * 2^3 sqrt(4)
abs(12-17*2/3-9)
factorial(4)
log(2, 10)
log(2, base = 10)
log10(2)
log(2)
[Natural log]
exp(0.6931472)
log10(2)
10^0.30103
sin(45 * pi / 180)
asin(0.7071068) * 180 / pi
2. Syntax: object.name = mathematical. Expression
ans1 = 23 + 14/2 18 +(7*pi/2) Type ans1 to display its value.
ans2 = 13 + 11 + (17 - 4/7)
ans1 + ans2 / 2
ans3 = ans2 + 9 - 2 + pi
ans4 <- 3 + 5
ans5 <- ans1 * ans2
ans3 + pi / ans4 -> ans6 (Here we cant use = sign)
3. Using the combine command (c) for making data
Entering Numeric data
data1 = c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)
Type data1 to display it.
data2 = c(data1, 4, 5, 7, 3, 4)
data1 = c(6, 7, 6, 4, 8, data1)
Entering Text data items

day1 = c('Mon', 'Tue', 'Wed', 'Thu')


day1 = c(day1, 'Fri')

mix = c(data1, day1)

R- Command
Result
Fig.2: My exercises
4. Use of scan() to make Numerical data
data3 = scan()
Type some numerical values, separated by spaces: 6 7 8 7 6 3 8 9 10 7
Press the Enter key and type some more numbers on fresh line 11: 6 9
Press the Enter key once again to create a new line 13:
Press the Enter key once more to finish the data entry 13: Read 12 items.
Now type data3 (the name of the object) to display its contents.
5. Enter Text as data
day2 = scan(what = 'character')
1: Mon Tue Wed
4: Thu
5:

Read 4 items
Type day2 on the R prompt to display the contents

6. Comma-separated input data.


data4 = scan(sep = ',')
1: 23,17,12.5,11,17,12,14.5,9
9: 11,9,12.5,14.5,17,8,21
16:
Read 15 items
Type data4 to display the contents (Why data is displayed with decimal points?)
7. Another example of comma-separated input data
data5 = scan(sep = ',', what = 'char')
1: "Jan","Feb","Mar","Apr","May","Jun"
7: "Jul","Aug","Sep","Oct","Nov","Dec"
13:
Read 12 items
Type data5 to display the contents
8. Reading a file of data from a disk
data6 = scan(file = 'test data.txt')
Read 15 items
Type data6 to display its contents.
9. R looks for data file in the working directory. We can find the working directory by using the
command getwd() command. If the file is somewhere else we must type its name and full
location.
10. It may be easier to point permanently at a directory so that the files can be loaded simply by
typing their names. We can alter the working directory using the setwd() command.
setwd(Desktop). To step up one level we can use the command: setwd(..).
11. In Windows and Macintosh OS there is an alternative method that enables us to select a file. This
opens a browser-type window where we can navigate to and select the file we want to read:

data7 = scan(file.choose()) // choose the file test data in My Documents


Read 15 items
Type data7 to display its contents.

12. In the preceeding example, the target file was a plain text file with numerical data separated by
spaces. If we have text or items are separated by other characters, we use what = and sep =
instructions as appropriate.
data8 = scan(file.choose(), what = 'char', sep = ',')
// choose file CVS in My
Documents
Read 12 items
Type data8 to display its contents.

In this example, the target file contained the month data that you met previously; the file was a
CSV file where the names of the months (text labels) were separated with commas.
13. Viewing All Objects:

ls()
ls(pattern = 'b')
ls(pattern = 'be')
ls(pattern = '^b') ------- all items whose names start with b
ls(pattern = '^be') ------ all items whose names start with be
ls(pattern = '^b|^e') ---- all items whose names start with either b or e
ls(pattern = 'm$') ---- all items whose names end with m
ls(pattern = 'a.e') ---- all item names contain a and e with possibly other letters in
between
ls(pattern = 'a..e') all item names contain a and e with a dot and possibly other
letters in between

14. Examining Data Objects


str(grass) ---- data frame
str(grass.l) --- list
str(bird) - matrix
class(grass.l) - list
class(grass) --- data frame
class(bird) -- matrix
class(month) - character
class(mow) - integer
Save and Read a Binary Data File to and from Disk
1. We create a simple data object to hold a simple numerical vector.
savedata = c(9, 2, 4, 6, 5, 7, 9, 2, 1, 1, 7)
2. We can see the newly created object by using ls() command and also by its name.
savedata
3. Now we save our new data object to a file.
save(savedata, file = 'savedata test.Rdata')
4. Next, we remove the object from R using rm() command.
rm(savedata)
5. We can check that the object is gone by typing its name or using the ls() command. Once we are
convinced that the data is really gone, we can use the load() command to read the file from disk.
load(file = 'savedata test.Rdata')
6.

Alternatively we can use file.choose() as the filename and select the file from the browser (in
Windows operating systems)
load(file = file.choose())

7. Type in the name of the object to get its data.


savedata

S-ar putea să vă placă și