Sunteți pe pagina 1din 6

In the previous video we talked about represting data in text when you are trying to communicate.

Now we are going to talk a bit about representing data in R, which where we will be doing most of our data analysis. First we are going to talk about the important data types in R The classes of data types that you can have, such as characters, numeric values, integers, and logicals, as well as objects, like vectors, matrices, data frames, lists, factors, missing values. And then we'll talk a little bit about operations like subsetting and logical subsetting. For more information, see the data type video that's created, for the computing for data analysis class, and it's also included as background material for this course. So we're going to start off with characters. So in R, you can assign, a variable, like firstName, to have a value, like jeff, where jeff is in quotes. If you look at the class of this variable, firstName, by applying the class function like this, you get out something that says character. So this is a character variable. You can also type, first name, to actually see the variables assigned value, which is Jeff. Character variables are good for storing text, as opposed to storing numeric values. Numeric val, values, can be stored, in numeric variables. So for example, I'm storing here my height in cm, in the variable, heightCM, and I'm assigning it a value of 188.2. I can look at the class of this variable to see that it's numeric, or we can type the variable and hit return to see its value, 188.2. Much of the data that we'll be using in this class will be numeric data, and will be assigned to numeric variables. You can also assign integer variables. In some cases, you want to be discrete about the data that you're representing. They shouldn't be able to take on any continuous value, and they should only take on integer values. To do this, we take the integer that we care about, in this case, say 1, followed by a capital L.

So, we are assigning the number of sons that I have, variable, numberSons, to be equal to one L. You can look at the class of this variable and see there's an integer where you hit numberSons, and hit return, and you get the 1 out. Note that if you would assign the value of just one without the L, you could have done the same analysis, and when you said numberSons, you would still get 1 out here at the end, but it might be a different class. If you care about a variable being an integer, you need to assign it with the L operator. Another kind of example, that might become, that will become useful, especially when performing coding, that requires for-loops, if-loops, or other control structures, is assigning logical values. So here we can assign a value, a variable, called teachingCoursera, because I'm teaching the Coursera course, and we can set it to be equal to true. If we look at a class of this variable, it's a logical variable, and if we hit type teaching Coursera and we hit return we get true. We can use these variables to perform comparisons that we can later use in logical structures. I'll be talking about these types of variables as we go on in the class, and what their properties are and when they're used, but it's a good idea to review what the different types are. Once we've assigned variables of a particular class, we can ca-, create sets of those variables, and assign them to vectors. Vectors are a set of values with the same class. We can create them with the c operator. And c, where c stands for concatenate. So here I'm setting a set of heights, to be the values, 188.2 181.3, 193.5. If I type heights, I then get all 3 of those values. Values, in the c, concatenate operator, are separated by commas. You can also create a vector, that is con, that is, consists of character values. So for example Here I'm creating a vector called firstNames, that consists of 4 character values, jeff, roger, andrew, and brian. Again, I've separated them by commas, and

if I type first names and hit return, I see those values back out. Sometimes we might want to concatenate different types of variables together, where they're, where they have different classes. A vector of values of possibly different classes is called a list. So here I create 2 vectors, vector 1 and vector 2. Vector 1 has 3 values that are numeric. Vector 2 has 4 values that are characters. I can then create a list using the List command that puts those two vectors together into one object called a list. If I type my list, I then see both the heights and the first names have been stored in the variable, myList. Another type of vector that might be of interest during the class are matrices. Matrices are just vectors with multiple dimensions. So instead of storing one set of values, you can store values in multiple rows. So here, I'm creating a variable myMatrix, and the way that I'm doing that is I'm assigning the values 1, 2, 3 and 4 to that matrix and I'm telling them to be stored in the matrix. By rows, in other words it's going to work, run from left to right, storing values across rows, until it hits the end of a row, then starting a new row and filling from left to right. Here I'm telling it to have 2 rows in the matrix, so the values start off as 1, 2, it hits the end of the first row, and then, returns and starts, 3, 4, filling in the values left to right again. The most commonly used object that we'll be using in this class are data frames. These are multiple vectors, of possibly different classes of the same length. So for example, if I create those same two vectors with three numeric values, and four character values, and try to create a data frame with those values, it says that the arguments are different because they have different numbers of rows. The reason being would, these two vectors of different lengths, three and four. In that case, the data frame that we've created cannot be found. However, if we add a fourth measurement to the vector 1, so we have 4 numeric values and 4 text values, we can create a

data frame using the data.frame command, and assign it the values of heights and first names. If you look at my data frame, it now looks like this. Each column is labeled with the variable name heights and first names, and each row contains the corresponding first value of heights and a corresponding first value of first names. The way a data frame is structured, the values of the first row should correspond to one observation. So for example, 188.2 is assumed to be the height of Jeff, and 181.3 would be the height of Roger. Another type of variable that we'll be using quite often in this class are called factors. So a factors are qualitative variables, that can be included in models. It's often hard to include, character vectors, directly into statistical models in r, and so a different, category of variable is created, to be able to include it in a model. So for example, if we create a vector smoker, that consists of characters yes, no, yes, no Four characters, one for each observation, in the study, and we want to be able to use this vector, to analyse something about the differences between the smokers and non-smokers. We would generally create a factor with the variable function as dot factor. We now have a factor of smoker variable, and when we, we report this variable out, you see the values yes, no, yes, yes, no, and the levels, that correspond to those values, no and yes. If there were 3 values, including a maybe here, you would see levels no, yes and maybe. We are talking about he half factors we use when we get on to statistical modelling. Another important variable that we will be considering are missing values, in all they are coded as NA. So, for example in this vector one that I have created here, There are 3 numeric values and 1 missing value which I've just typed as NA. If I type vector1 I get the value 188.2, 181.3, 193.4 and NA, which suggests that there's a missing value here. I can also use the command is.na To determine which of the values, are missing, in this particular vector. So the first 3 values are not missing,

but the last value is missing. Throughout the course, we will code missing values with NA, and learn about how to deal with them. Next I'm going to briefly go over subsetting. So while we're doing our data analyses, we will often want to only consider part of a particular vector or data frame. So here, I've generated two vectors again, one numeric with four values, and one, a character vector, again, with four values. I then put them together into one single data frame. Now if I want to access just the first value in the vector variable, vector 1 variable, I can do vector 1, open hard bracket, 1, close hard bracket. That will return just the first value from vector 1. Similarly, I can use the concatenate variable to look at indexes of particular values. So suppose I wanted the first, second, and fourth values of the vector 1 variable. I can then say, I can then subset vector 1 using the same hard bracket command, and passing at the indexes 1, 2, and 4 suggesting which values in that vector I would like to access. Similarly We can look at, specific rows and columns in a data frame. Here, for this data frame, I am looking at the first row, and the first 2 columns values. That returns for me, the heights, and first names, from the first row. Alternitivly I can access a particular column using the $ operator. So this operator can be applied by saying mydataframe$ and then the name of the variable you want to access, in this case first names. And this will then give all the first names in the In the data frame. We can also subset by particular logical operations. So recall that we talked about logical variables at the beginning of this lecture. They can either be true or false. So here's an example. I have my data frame, and suppose that I want to identify all the rows in that data frame, corresponding to cases where the first names are equal to Jeff. In that case, I can use the variable name

firstNames and check using the equals equals operator if it's equal to Jeff. This particular vector will return true only when Jeff appears. This is equivalent to saying that you only want the first row of the data frame, since only the first row of the data frame corresponds to our first name of Jeff, and indeed that's what's returned here. You can also do things like try to identify the parts of the data frame corresponding to heights in a particular range. For example, this looks at only, this will return only the rows in the data frame corresponding to heights less than 190. Alternatively, I can make, put commands after the comma, and that will deal with columns rather than rows. A couple, a quick note on variable naming conventions in R. Variable names should be short or descriptive. There is some common styles. For example, Camel caps which have variables that also need between lower case and upper case where the first letter of each new word is upper case. Another example is putting underscores in between each separate word, or using dots between each separate word. Each of these conventions is used by different people at different times. You should pick whichever one is most comfortable to you. You can see style guides at these different websites that I've linked to here. And you'll see, that each style guide suggest naming variables, functions and so forth, in slightly different ways. I know that this has been a quick tour of these particular concepts. If you are having any trouble following along in the lectures that follow, please consider viewing the computing for data analysis videos created by Roger, so that you will be able to understand all of the data analyses we are performing.

S-ar putea să vă placă și