Documente Academic
Documente Profesional
Documente Cultură
Why Analytics?
Business Intelligence and Analytics (BI &A) and their related field of Big Data Analytics have become
increasingly important for business communities over the past two decades. There has been an explosion
of amount of data available in the world. The experts believe that by end of 2015, total amount of data
generation will reach a gigantic volume of 7.9 Zettabyte, where 1 Zettabyte is equivalent to 1 trillion
Gigabytes, Analytics is being used today and will be more used tomorrow for various applications
involving large volumes of data like life science research, analysis of consumer behavior, analysis of
social media usage, weather forecasting, accurate healthcare services etc. Thus, it has become very
important to understand, infer and learn quickly and often in real-time from the humongous data in
todays information-driven business world. It is quite natural that a large number of analytics tools and
software have evolved over the last few years to handle these challenges. However, the programming
language R is rapidly becoming the de facto standard among the professionals in the analytics industry
because of its extremely rich set of libraries and because of it being an open source tool.
Why R?
R is a flexible and powerful open-source implementation of the language S (for statistics) developed by
John Chambers and others at Bell Labs. R has eclipsed S and the commercially available S-Plus program
for many reasons. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at
the University of Auckland. In 1993 the first announcement of R was made to the public. Rosss and
Roberts experience developing R is documented in a 1996 paper in the Journal of Computational and
Graphical Statistics:
Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3):299314, 1996
In 1995, Martin Mchler made an important contribution by convincing Ross and Robert to use the GNU
General Public License to make R free software. This was critical because it allowed for the source code
for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software
later).
In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group
was formed, containing some people associated with S and S-PLUS. Currently, the core group controls
the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R
version 1.0.0 was released to the public.R is free, and has a variety (nearly 4,000 at last count) of
contributed packages, most of which are also free. R works on Macs, PCs, and Linux systems. In this
book, you will see screens of R 3.2.3 running in a Windows 7 environment, but you will be able to use
everything you learn with other systems, too. Although R is initially harder to learn and use than a
spreadsheet or a dedicated statistics package, you will find R is a very effective statistics tool in its own
right, and is well worth the effort to learn.
Here are five compelling reasons to learn and use R.
R is open source and completely free. It is the de facto standard and preferred program of many
professional statisticians and researchers in a variety of fields. R community members regularly
contribute packages to increase Rs functionality.
R is as good as (often better than) commercially available statistical packages like SPSS, SAS,
and Minitab.
R has extensive statistical and graphing capabilities. R provides hundreds of built-in statistical
functions as well as its own built-in programming language.
R is used in teaching and performing computational statistics. It is the language of choice for
many academics who teach computational statistics.
Getting help from the R user community is easy. There are readily available online tutorials, data
sets, and discussion forums about R.
R combines aspects of functional and object-oriented programming. One of the hallmarks of R is implicit
looping, which yields compact, simple code and frequently leads to faster execution. R is more than a
computing language. It is a software system. It is a command-line interpreted statistical computing
environment, with its own built-in scripting language. Most users imply both the language and the
computing environment when they say they are using R. You can use R in interactive mode, which we
will consider in this introductory text, and in batch mode, which can automate production jobs. We will
not discuss the batch mode in this book. Because we are using an interpreted language rather than a
compiled one, finding and fixing your mistakes is typically much easier in R than in many other
languages.
Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of
drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the
original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D
graphics (but things have improved greatly since the old days).
Another commonly cited limitation of R is that objects must generally be stored in physical memory. This
is in part due to the scoping rules of the language, but R generally is more of a memory hog than other
statistical packages. However, there have been a number of advancements to deal with this, both in the R
core and also in a number of packages developed by contributors. Also, computing power and capacity
has continued to grow over time and amount of physical memory that can be installed on even a
consumer-level laptop is substantial. While we will likely never have enough physical memory on a
computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a
bit easier over time.
At a higher level one limitation of R is that its functionality is based on consumer demand and
(voluntary) user contributions. If no one feels like implementing your favorite method, then its your job
to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect
the interests of the R user community. As the community has ballooned in size over the past 10 years, the
capabilities have similarly increased. When I first started using R, there was very little in the way of
functionality for the physical sciences (physics, astronomy, etc.). However, now some of those
communities have adopted R and we are seeing more code being written for those kinds of applications.
If you do not already have R running on your system, download the precompiled binary files for your
operating system from the Comprehensive R Archive Network (CRAN) web site, or preferably, from a
mirror site close to you. Here is the CRAN web site: http://cran.r-project.org/
Download the binary files and follow the installation instructions, accepting all defaults. Launch R by
clicking on the R icon. For other systems, open a terminal window and type R on the command line.
When you launch R, you will get a screen that looks something like the following. You will see the label
R Console, and this window will be in the RGui (graphical user interface).
Official Manuals
An Introduction to R
R Data Import/Export
Writing R Extensions: Discusses how to write and organize R packages
R Installation and Administration: This is mostly for building R from the source code)
R Internals: This manual describes the low level structure of R and is primarily for developers and R
core members
R Language Definition: This documents the R language and, again, is primarily for developers
R- Command
Result
Fig.2: My exercises
4. Use of scan() to make Numerical data
data3 = scan()
Type some numerical values, separated by spaces: 6 7 8 7 6 3 8 9 10 7
Press the Enter key and type some more numbers on fresh line 11: 6 9
Press the Enter key once again to create a new line 13:
Press the Enter key once more to finish the data entry 13: Read 12 items.
Now type data3 (the name of the object) to display its contents.
5. Enter Text as data
day2 = scan(what = 'character')
1: Mon Tue Wed
4: Thu
5:
Read 4 items
Type day2 on the R prompt to display the contents
12. In the preceeding example, the target file was a plain text file with numerical data separated by
spaces. If we have text or items are separated by other characters, we use what = and sep =
instructions as appropriate.
data8 = scan(file.choose(), what = 'char', sep = ',')
// choose file CVS in My
Documents
Read 12 items
Type data8 to display its contents.
In this example, the target file contained the month data that you met previously; the file was a
CSV file where the names of the months (text labels) were separated with commas.
13. Viewing All Objects:
ls()
ls(pattern = 'b')
ls(pattern = 'be')
ls(pattern = '^b') ------- all items whose names start with b
ls(pattern = '^be') ------ all items whose names start with be
ls(pattern = '^b|^e') ---- all items whose names start with either b or e
ls(pattern = 'm$') ---- all items whose names end with m
ls(pattern = 'a.e') ---- all item names contain a and e with possibly other letters in
between
ls(pattern = 'a..e') all item names contain a and e with a dot and possibly other
letters in between
Alternatively we can use file.choose() as the filename and select the file from the browser (in
Windows operating systems)
load(file = file.choose())