Sunteți pe pagina 1din 4

In this video I'll be talking about the sources of data sets.

What I mean by the sources of data sets are the way in which the data were obtained, measured, or collected. These components of a data set play an important role in the way that you can answer specific types of questions in data analyses. So data are define by the way that they are collected. And there are several main types. There's a census, which can typically used, be used for descriptive problems. There'a another observational study, which is usually used for inferential problems. There's a convenience sample which may be used by all types. It's a particular kind of sample that may be bias. There is a randomized trial, which is often used for causal analysis. Those are the main types of analysis that are performed, but there are other types as well. There is a prediction study, for being able to perform prediction analyses and there are also studies over time. These are cross sectional. Which were usually used for inferential purposes, or longitudinal which are used for inferential or predictive purposes. Finally, there are retrospective studies which are used for inferen-, inferential purposes as well. I'll be going over each of these types of data collections. So to understand this we first need to consider the set of objects that we're interested in. In this case, it's this set of people. There are 8 people in this particular population, and we want to know something about that population. So what you can do is take 1 individual from that population, in this case individual 4, and measure her. You can collect information about her name, about whether she has cancer or not, whether she's a smoker or not, whether she lives in Baltimore, and whether she exercises. Then, there are different ways that you can collect the data about the individuals in the population and measure these variables. In a census you measure each of the individuals.

I've indicated the individuals that have been measured here with blue. Since we measure the variables on all individuals, there is no inferential problem. In other words, we don't need to use a small subset of these individuals to say something about the larger population. In an observational study we might take a random sample of individuals. Here, I have set the c, to make the results reproduce-able, and I have sampled the individuals from one to eight. I have taken a random size sample of them of size four, and I have used sampling without replacement, in other words each individual can only be samples once. This results in sampling indviduals 2, 5, 6 and 8. I might then measure those individuals, 2, 5, 6 and 8, and using the data on those in, individuals, I might try to make an inference about the population of all individuals in, in this study. All 8 individuals. In the idealized, randomized, observational study, individuals are sampled at random from the population. But often it's not so easy to do that. So, for example, suppose that the person measuring the individuals was standing over here on the left-side of the picture. And it was much easier for them to measure individuals that were closer to them. In this case, it might be 5 times as likely for them to measure. An individual from 1, 2, 3 and 4, then from 5, 6, 7 and 8. Here, I've generated a set of probabilities that are 5 times as large for individuals on the left side of the figure from individuals on the right. I sample a, a size 4 random sample from the population with probabilities equal to those probabilities weighted towards people on the left hand side. This is representing the case, where this person on the left, is having a hard time sampling individuals on the right side of the picture. And indeed, we observe, that more individuals on the left, are sampled, than individuals on the right. In this case, if there is some difference between the individuals on the left and the right, that we have not accounted for it may be difficult to perform an

inference to the entire population, with a convenience sample. So care should be in, used, when performing statistical inference with convenience samples. A similar, similar care should be used, when performing predictive analysis. The previous studies, are good for inferential, or descriptive analysis. If the goal is a causal analysis, in other words, if we want to find a variable that wind-change directly leads directly into another variable, we may need to perform a randomized trial. In a randomized trial, patients are assigned, say, to two different treatments, and then follow it up to see what happens to them after those treatments have been given. If there's a difference in what happens to them at, between the 2 treatments, we can then determine that the treatments cause that difference. The reason is that the, randomization of the treatment variable is independent of all the other variables in the population. Here we use r to perform this randomized trial First, we sample 2 individuals, without replacement, from the population. And then, among the remaining individuals, we take another sample of size 2. The first 2 individuals are assigned to Treatment 1, and the second 2 of in, individuals are assigned to Treatment 2. So individuals 1 and 8 get Treatment 1, and individuals 3 and 4 get Treatment 2. We then follow them up to see if there's any difference in their outcome. And perform an analysis to see if that treatment caused that difference. The next type of study is a prediction study. In this kind of study we need 2 data sets. First, a training set where we will build a predictive model. And second a test set where we will evaluate the predictive model. In the training set we'll take a random sample of size 4 from the population, without replacement. Since we've set the scene we get the same individuals, 2, 5, 6 and 8. We measure all of their variables and then we try to build a predictive function, say to predict whether or not they have cancer based on their smoking status, their location and whether they

exercise or not. Then we take a second sample from the remaining individuals and measure all of their variables. We apply our predictive function to the variables on exercise status, smoking status, and location, and see if we can predict or not whether the individuals have cancer. This will allow us to evaluate our Predictive function. Next, I will consider the, a couple of types of studies that measure data over time. In this case, imagine we have the same 8 individuals and we measure them at, we could measure them at Day 1, Day 2 or Day 3. A cross sectional study picks a particular time point and takes a random sample of individuals from that time point. Another type of study that you could perform, is a longitudinal study. In this type of study, you assign individuals, to follow from the very beginning. So, we take a random sample of individuals at day one. And we get 2, 5, 6, and 8. And then we follow them over time. Here the goal might be inference, or might be prediction as well. Finally, the last example of study we will considered is a retrospective. In this case, again, you are considering a case where you are following individuals over time. But instead them at sampling at the beginning of the time series we go to the very end, day 3. We take a random sample of individuals. 2, 5, 6 and 8. And here, we measure their outcome, whether they have cancer or not. The nice thing here is, if it's a long period of time, before cancner develops, we can sample people with cancer, and people who don't have cancer, at the very end of the study. Then, instead of measuring their outcome, we measure how many of them are exposed. And we use this information to try to identify a relationship between outcome and exposure. Here again, the goal is inferential, rather than predictive or causal.

S-ar putea să vă placă și