Sunteți pe pagina 1din 51

An Introduction to R

Text Analytics using R

Sandip Mukhopadhyay
What is R?

R is a scripting/programming language and


environment for statistical computing, data
science and graphics.
R is a successor of the proprietary statistical
computing programming language S.
It is an important tool for computational
statistics, visualization and data science.
WHAT IS R?
 GNU Project Developed by John Chambers @ Bell Lab
 Free software environment for statistical computing and graphics
 Functional programming language written primarily in C, Fortran
HISTORY AND EVOLUTION OF R
R has developed from the S language

S Version 1

S Version 2

S Version 3

S Version 4
Developed 30 years ago for research
applied to the high-tech industry
HISTORY AND EVOLUTION OF R
The regular development of R

1990’s: R developed
concurrently with S
1993: R made public

Acceleration of R development
 R-Help and R-Devl mailing-lists
 Creation of the R Core Group
HISTORY AND EVOLUTION OF R

Growing number of packages

2001: ~100 packages

Today: Over 10152 packages

2000: R version 1.0.1


Today: R version 3.6.1

Source: R Journal Vol 1/2


Reasons to learn R

• Free, Open source


• Preferred option in academia and research
• Great visualization
• Advanced statistics
• Integration with other programming
languages
• Supportive open source community
• Easy extensibility via packages
• Many find it easier to learn compared to
Python
Limitations of R

• Lack of scalability
• Less acceptance in Industrial application
compared to its peer Python
• Application of R is limited to data-science,
while Python has wider usage
R Studio
R studio is a widely used IDE for writing, testing and executing R
codes. There are various parts in a typical screen of R studio IDE.
These are:
Console see the output
Syntax editor when we can write the code
Workspace tab where users can see active objects from the code written in
the console
History tab that shows a history of commands used in the code
File tab where folders and files can be seen in the default workspace
Plot tab shows graphs
Packages tab shows add-ons and packages required for running specific
process(s)
Help tab contains the information on IDE, commands, etc.
Syntax editor History

Console Help / Viewer


Packages in R

A package in R is the fundamental unit of


shareable code. It is a collection of the
following:
• Functions
• Data sets
• Compiled code
• Documentation for the package and for the functions
inside

Packages which are not part of core R need to be installed


This package also need to be loaded before every session.
library(“ggplot2”)
Few commands to get started

packageDescription(“ggplot2”)
help(package = “ggplot2”)
find.package(“ggplot2”)
install.packages(“ggplot2”)
Some basics about R coding

• R statements or commands can be separated by a semicolon (;) or a


new line.
• The assignment operator in R is "<-" (Although "=" also works)
• All characters after # are treated as comments.
• There are no multiline or block-level comments.
• The $ (dollar) operator in R is analogous to a “.” (dot) operator in
other languages.
• Single inverted comma ‘ ’ and double inverted comma “ ” work
similarly
• First bracket ( ) and third bracket [ ] work very similarly. Hardly there
is any use of second bracket { }.
Functions and Help in R

• There are over 1,000 functions at the core of R, and new R functions
are created all the time.
• Each R function comes with its own help page. To access a function’s
help page, type a question mark followed by the function’s name in
the console.
Reference materials / e-books

1. R_for_dummies- Andrie de Vries and


Joris Meys
2. Hands-On Programming with R - Garrett
Grolemund
3. Introduction to R- Programming - W. N.
Venables, D. M. Smith and the R Core
Team
4. R for Beginners _ cran-r projects -
Emmanuel Paradis
5. The Art of R- Programming - Norman
Matloff
Reference materials / other R resources

1. R-blogs : https://www.r-bloggers.com
2. R tutorials :
https://www.programiz.com/r-
programming/
3. R Video book : https://www.r-
bloggers.com/in-depth-introduction-to-
machine-learning-in-15-hours-of-expert-
videos/
4. Stackoverflow
5. R pubs
Reference materials / other analytics resources

1. www.analyticsmag.com
2. www.kdnuggets.com
3. www.analyticsbridge.com
4. www.datapine.com
5. www.datasciencecentral.com
Operators in R
Commonly used function in R

• if...else Statement
• switch statement
• "For" loop
• "While" loop
• “repeat” loop
• Next statement
• Break statement
Commonly used function in R
Commonly used function in R
Summary : what we have learnt

• Four types of operators in R are arithmetic,


relational, logical, and assignment.
• Two types of conditional statements in R are
if…else and nested if…else.
• Three types of loops in R are for loop, while
loop, and repeat loop.
• The commonly used functions in R
This concludes the session :
Basic R Programming

Next session : Data Structures


QUIZ TIME
Types of Data Structure in R

• Scalars – single numbers; also called zero dimensional vector


• Vectors – a row of numbers; also called one dimensional array
• Matrices - These are two-dimensional data structures
• Arrays - Similar to matrices; these can have more than two
dimensions.
• Data frames - These are the most commonly used data structures in
R. A data frame is similar to a general matrix, but it can contain
different modes of data, such as a number and character.
• Lists - These are the most complex data structures. A list may contain
a combination of vectors, matrices, data frames, and even other lists.
Types of Data Structure in R : Scalar

Scalars – single numbers; also called zero dimensional vector

Example:
f <- 3 # numeric
f
g <- "US" # text
g
h <- TRUE # logical
h
Types of Data Structure in R : Vector

Vectors – a row of numbers; also called one dimensional array.


One dimensional
Example:
a <- c(1, 2, 5, 3, 6, -2, 4)
a
b <- c("one", "two", "three")
b
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
Vectors
• Vectors are stored like arrays in C
• Vector indices begin at 1
• All Vector elements must have the same mode such as integer,
numeric (floating point number), character (string), logical (Boolean),
complex, object etc.

Create a vector of numbers

The c function (c is short for combine) creates a new vector consisting of three
values: 4, 7, and 8.
Vectors
A vector cannot hold values of different data types.
Consider the example below. We are trying to place
integer, string and boolean values together in a
vector.

Note: All the values are converted to the same data


type, i.e. “character”.
Vectors

Accessing the value (s) in the vector


Create a variable by the name, “VariableSeq” and assign to it a vector
consisting of string values.

• Access values in a vector, specify the indices at which the value is


present in the vector. Indices start at 1.
Types of Data Structure in R : Matrices

Matrices - These are two-dimensional data structures

Example:
vector <- c(1,2,3,4)
f <- matrix(vector, nrow=2, ncol=2)
f
[,1] [,2]
[1,] 1 3
[2,] 2 4
Matrices
To access the 2nd column of the matrix, simply provide the column number and
omit the row number.

To access the 2nd and 3rd columns of the matrix, simply provide the column
numbers and omit the row number.
Types of Data Structure in R : Arrays

Arrays - Similar to matrices; these can have more than two dimensions.

a <- matrix(c(1,1,1,1) , 2, 2)
b <- matrix(c(2,2,2,2) , 2, 2)
x <- array(c(a,b), c(2,2,2))
Types of Data Structure in R : Data frames

Data frames - These are the most commonly used data structures in R.
A data frame is similar to a general matrix, but it can contain different
modes of data, such as a number and character.

name <- c( “Ram” , “Laxman” , “Sita”, “Urmila” )


gender <- c(“M”, “M”, “F”, “F”)
age <- c(27,26,25, 24)
df <- data.frame(name, gender, age)
df
Data Frames

Think of a data frame as something akin to a database table or an Excel


spreadsheet.

Create a data frame


• First create three vectors, “EmpNo”, “EmpName” and “ProjName”

• Then create a data frame, “Employee”


Types of Data Structure in R : Lists

Lists - These are the most complex data structures. A list may contain a
combination of vectors, matrices, data frames, and even other lists.

Example:
vec <- c(1,2,3,4)
mat <- matrix(vec,2,2)
x <- list (vec, mat)
Data Frame Access

There are two ways to access the content of data frames:


• By providing the index number in square brackets.
Example:

• By providing the column name as a string in double


brackets.
Example:
Few R functions for understanding data in data frames

• dim()
dim()function is used to obtain dimensions of a data frame.

• nrow()
dim()function is used to obtain dimensions of a data frame.

• ncol()
ncol() function returns number of columns in a data frame.

• str()
str() function compactly displays the internal structure of R objects.

summary()
use the summary() function to return result summaries for each column of the
dataset.
Few R functions for understanding data in data frames
• head()
head()function is used to obtain the first n observations where n is set as 6 by
default.

• tail()
tail()function is used to obtain the last n observations where n is set as 6 by
default.

• edit()
• The edit() function will invoke the text editor on the R object.
Text Data in R

• Text in R is represented by character vectors. A character vector is a vector


consisting of strings of characters.
• In the world of computer programming, text often is referred to as a string.
Exemple:
> x <- "Hello world!"
> is.character(x)
TRUE
• We can find the length, length(), no of characters, nchar()
• We can extract a subset of a text vector
• Recycling character vectors -When you perform operations on vectors of
different lengths, R automatically adjusts the length of the shorter vector to
match the longer one. This is called recycling.
• Sorting of text by sort()
Text Data in R

• grep() – a very important function in ‘text matching’


• The name of the grep() function originated in the Unix world. It’s an acronym
for Global Regular Expression Print. Regular expressions are a very powerful
way of expressing patterns of matching text, usually in a very formal
language.
• Whole books have been written about regular expressions.
• The function name grep() appears in many programming languages that deal
with text and reporting. Perl, for example, is famous for its extensive grep
functionality.
Import and export of data in R

• Importing data from .csv file


• Two very important functions
• read.csv ()– it reads a .csv file from a specified file
• write.csv () – it creates a .csv file in the working directory
• read.csv () is a special case of read.table ()
• write.csv () is a special case of write.table()
Working with directory

getwd()
getwd() command returns the absolute filepath of the current working
directory.

setwd()
setwd() command resets the current working directory to another
location as per users’ preference.

dir()
This function returns a character vector of the names of files or
directories in the named directory.

version to view the version of the paper


Manipulating Text in Data
Functions Function Arguments Description
substr(a, start stop)  a is a character vector The function returns part of the
 Start and stop arguments contain a string starting from start argument
numeric value and ends at the stop argument.
strsplit(a, split, …)  a is a character vector The function splits the given text
 Split is also a character vector that string into substring.
contains a regular expression for
splitting.
paste(…, sep= “”, …)  The dots “…” define R objects The function concatenates string
 sep argument is a character string for vectors after converting the objects
separating objects into strings.

grep(pattern, a)  Pattern argument contains matching The function returns string after
pattern searching for a text pattern into a
 a is a character vector given text string.
toupper(a)  a is a character vector The function converts a string into
uppercase
tolower(a)  a is a character vector The function converts a string into
lowercase.

Copyright © 2018
List

To create a list, “emp” having three elements, “EmpName”, “EmpUnit”, “EmpSal”.

To get the elements of the list, “emp” use the below command.

Retrieve the names of the elements in the list “emp”.

Copyright © 2018
List
Add an element with the name “EmpDesg” and value “Software Engineer” to the
list, “emp”.

Output:

Delete an element with the name “EmpUnit” and value “IT” from the list,
“emp”.

Copyright © 2018
Methods for Reading Data
Reading CSV Files
A CSV file uses .csv extension and stores data in a table structure
format in any plain text. The following function reads data from a CSV
file:
read.csv(“filename”)
where, filename is the name of the CSV file that needs to be imported.

Reading Spreadsheets
read.xlsx(“filename”,…)
where, filename argument defines the path of the file to be read; the
dots “…” define the other optional arguments.

Copyright © 2018
List
Add an element with the name “EmpDesg” and value “Software
Engineer” to the list, “emp”.

Output:

Delete an element with the name “EmpUnit” and value “IT” from the list,
“emp”.
Data Frames

Think of a data frame as something akin to a database table or an Excel


spreadsheet.

Create a data frame


First create three vectors, “EmpNo”, “EmpName” and “ProjName”

Then create a data frame, “Employee”


Data Frame Access

There are two ways to access the content of


data frames:
By providing the index number in square
brackets.
Example:

By providing the column name as a string in


double brackets.
Example:
Few R functions for understanding data in data frames

head()
head()function is used to obtain the first n
observations where n is set as 6 by default.
tail()
tail()function is used to obtain the last n observations
where n is set as 6 by default.
edit()
The edit() function will invoke the text editor on the R
object.

S-ar putea să vă placă și