College Guesstimate

College Guesstimate
Rohan Battulwar, Prasheel Fuley, Annaiya Mahajan, Shivam Joshi, Roshan Chaturpale
Student, Information Technology, India;
Email: { rbatttulwar99@gmail.com, prasheel.fv123@gmail.com, ananiyamahajan69@gmail.com,

shivamjoshi7387@gmail.com , roshan.chaturpale@gmail.com }
Abstract
The ease of making better choices when it comes to making better decisions in terms of selecting colleges is our aim. Prediction analysis on colleges for the
students makes easier for them to make accurate decision about their preferred colleges. For such analysis, it requires future possibilities from the past record data
which can potentially make the predictions and recommendation for students. The predictive modelling analysis with the data mining methods would help giving
probable accuracy and this requires analytical methods for predicting future recommendation. The most apparent or reasonable algorithm for this decision would
be Decision Tree or the Random Forest Algorithm which comes in Supervised learning. This issue can be solved through the modelling techniques, available tools
for big data to make most probable decisions.
Keywords - Predictive modelling, Decision tree, Random forest, Big data, Prediction Analysis
I. INTRODUCTION II. TECHNOLOGIES UESD

Considering the current scenario of admission system of our state where a III. RSTUDIO (FOR STATISTICAL CONPUTING )
common governing body looks up on the process, so it is fairly easy to keep
track and get the admission records of the colleges as every student goes R is a language and environment for statistical computing and graphics. It
through same channel. We have large set of data comprising the score details is a GNU project which is similar to the S language and environment which
with marginal cut off of colleges across the states with the help of data was developed at Bell Laboratories (formerly AT&T, now Lucent
classification algorithm, the big data or the large data can be implemented to Technologies) by John Chambers and colleagues. R can be considered as a
make correct decision. different implementation of S. There are some important differences, but
much code written for S runs unaltered under R.
We are using here what we call a multi class classification technique to R provides a wide variety of statistical (linear and nonlinear modelling,
classify the outcomes based upon the past records or values that are associated classical statistical tests, time-series analysis, classification, clustering) and
to a set of fixed variables. The variables are nothing but the marks of the graphical techniques, and is highly extensible. The S language is often the
students and other deciding factors that comes in mind when the student vehicle of choice for research in statistical methodology, and R provides an
comes at a point where it has to make a decision of where to get admitted. The Open Source route to participation in that activity.
deciding factors here are the category, rank, cap round (mhcet process)
preferred branch and the locality of the college that the student wishes to get One of R’s strengths is the ease with which well-designed publication-
into. The locations are divided across the areas. quality plots can be produced, including mathematical symbols and formulae
where needed. Great care has been taken over the defaults for the minor
The techniques used for classification are varied across the bench as we design choices in graphics, but the user retains full control.
need to train the model where both input and desired output data are provided.
Input and output data are labelled for classification to provide a learning basis R is available as Free Software under the terms of the Free Software
for future data processing. Training data for supervised learning includes a set Foundation’s GNU General Public License in source code form. It compiles
and runs on a wide variety of UNIX platforms and similar systems (including
of examples with paired input subjects and desired output. FreeBSD and Linux), Windows and MacOS.
The most important feature that we have is the ability to import data
directly to the languages that we are using the huge data set firstly had to be Capable of making predictions that can be done using prediction
converted to a format that was suitable for the languages which is the tabular algorithms R supports various libraries that include these prediction
format defined and categorized as per the software requirements where tuples algorithms. R in built has all the functions.
and outcomes are separated accordingly.
Fig. II (i). R Environment While Predicting Outcome of

Animal Shelters
The fig. II (i). Shows R understand the

environment where a specific data
project of predicting animal’s
With mission “We help
shelters using supervised people see and understand their
learning and Random Forest data”, Tableau products are
model to do so is used. transforming the way people use
data to solve the problems.
1. TABLEAU Tableau makes analyzing data fast
and easy, beautiful and useful. No
i. Tableau helps wonder it has gained a growing
to see and interest among the business users,
Fig. I(i)- The Methodology

but also groups traditionally not using BI tools are taking it into use. You might On the other hand, Tableau integrates with programming languages like R
have noticed that for example more and more journalists are using Tableau to or Python and it can bring you a great help in the field of data science.
publish data stories to the web. One of the biggest Finnish media houses
Helsingin Sanomat is using Tableau to visualize the news related data.
IV. METHODOLOGY
Tableau is revolutionizing the way data is being used. To access and
further analyze the data doesn’t anymore require IT department participating. Predictive analytics is a form of advanced analytics that uses both new and
The data is accessible to all levels of organizations and individuals. historical data to forecast activity, behavior and trends. It involves
Democratizing data allows people to think and act quickly, bringing applying statistical analysis techniques, analytical queries and
transparency and agility to fact-based decision making. automated machine learning algorithms to data sets to create predictive
models that place a numerical value -- or score -- on the likelihood of a
ii. Tableau suits different kinds of needs and organization particular event happening.
If you are an individual user, Tableau Desktop is the tool for you. It’s an Predictive analytics software applications use variables that can be
application that resides on your computer and is aimed at individual use. It is measured and analyzed to predict the likely behavior of individuals,
used for creating data visualizations, publishing data sources as well as machinery or other entities. For example, an insurance company is likely to
workbooks to Tableau Server. take into account potential driving safety variables, such as age, gender,
location, type of vehicle and driving record, when pricing and issuing auto
For enterprises there is a solution as well. Tableau Server is then the way to insurance policies.
go – it is aimed at collaboration and security of data visualizations. The data
can be taken from anywhere, and shared within the organization via desktop or Multiple variables are combined into a predictive model capable of
mobile browsers. There are apps available for different kind of mobile devices, assessing future probabilities with an acceptable level of reliability. The
e.g. iPhone or Android based mobile phones. Tableau Server is an on-premise software relies heavily on advanced algorithms and methodologies, such as
solution. logistic regression models, time series analysis and decision trees.
There is also a possibility for online server solution with Tableau Online. Predictive analytics has grown in prominence alongside the emergence of
Tableau online cuts down the need for enhanced IT infrastructure or the need of big data system. As enterprises have amassed larger and broader pools of data
IT support. However, it doesn’t suit enterprise sized organizations. in Hadoop clusters and other big data platforms, they have created increased
data mining opportunities to gain predictive insights. Heightened development
iii. Tableau is a killer for creating visual dashboards – easy and fast and commercialization of machine learning tools by IT vendors has also
Tableau is primarily used for helped expand predictive analytics capabilities.
 Creating dashboard, reports 1. PHP is a sever side scripting language used to develop static
 Data discovery and Visualization webpages and dynamic webpages. PHP scripts can be interpreted on
a server that has php installed.
 Self-Service BI
2. Random forests can be used for robust classification, regression and
 Simple statistical analytics like trends and forecasting feature selection analysis.
Tableau products are continuously developed keeping in mind the ease of 3. Data visualization is the presentation of data in a pictorial or
use. What makes Tableau so unique compared to other self-service BI tools is graphical format. It enables decision makers to see analytics
quality of data visualizations and self-service analytics. Tableau has set golden presented visually, so they can grasp difficult concepts or identify
standards for self-service BI tools which provides business users to analyze new patterns. With interactive visualization, you can take the concept
data without need of IT intervention. It is easy and fast. a step further by using technology to drill down into charts and
graphs for more detail, interactively changing what data you see how
iv. You don’t need to do any coding it’s processed.
Technically, no need to code in Tableau. Almost every functionality is
possible using drag and drop. Tableau provides you with in-built table
calculations to add complex analysis with a click of mouse.
Fig. II (ii) – Taleau Environment (Representaion of rank cutoff across various categories
V. APPROACH The fundamental learning approach is to recursively

divide the training data into buckets of homogeneous members
1. Decision Tree Methods through the most discriminative dividing criteria. The
measurement of "homogeneity" is based on the output label; when
it is a numeric value, the measurement will be the variance of the Euclidean Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
bucket; when it is a category, the measurement will be the entropy
or gini index of the bucket. During the learning, various dividing
criteria based on the input will be tried (using in a greedy manner);
when the input is a category (Mon, Tue, Wed ...), it will first be
turned into binary (isMon, isTue, isWed ...) and then use the
true/false as a decision boundary to evaluate the homogeneity;
when the input is a numeric or ordinal value, the less Than, greater
Than at each training data input value will be used as the decision
boundary. The training process stops when there is no significant
gain in homogeneity by further split the Tree. The members of the
bucket represented at leaf node will vote for the prediction;
majority wins when the output is a category and member average
when the output is a numeric.
Fig. II. (ii)- Avg. Rank and Score as per Dept. and college
The good part of Tree is that it is very flexible in terms 4. Multiclass Classification
of the data type of input and output variables which can be Multiclass classification means a classification task with
categorical, binary and numeric value. The level of decision nodes more than two classes; e.g., classify a set of images of fruits which may
also indicates the degree of influences of different input variables. be oranges, apples, or pears. Multiclass classification makes the
The limitation is each decision boundary at each split point is a assumption that each sample is assigned to one and only one label: a
concrete binary decision. Also, the decision criteria only consider fruit can be either an apple or a pear but not both at the same time.
one input attributes at a time but not a combination of multiple In case of college predictor, the classes with us are the
input variables. Another weakness of Tree is that once learned it different branches and colleges together.
cannot be updated incrementally. When new training data arrives,
you have to throw away the old tree and retrain every data from
scratch.
2. Artificial Neural Network
Neural Network can be considered as multiple layer of
perceptron (each is a logistic regression unit with multiple binary
input and one binary output). By having multiple layers, this is
equivalent to:
z = logit(v1.y1 + v2y2 + ...), while y1 = logit(w11x1 + w12x2 + ...)
Fig. IV 4. (i)- Multiclass classification results

This multi-layer model enables Neural Network to learn
non-linear VI. CONCLUSION
relationship between input x and output z. The typical learning
A huge amount of data is being generated every single moment, even
technique is "backward error propagation" where the error is
while you’re reading these words, at unbelievably rapid speeds across the
propagating from the output layer back to the input layer to adjust
globe. According to an estimate, the global annual rate of data production in
the weight.
the year 2015 was 5.6 Zettabytes. That was almost double the rate of growth
Notice that Neural Network expect binary input which
just three years back in the year 2012.
means we need to transform categorical input into multiple binary
variable. For numeric input variable, we can transform that into
binary encoded 101010 strings. Categorical and numeric output When it comes to technology management, planning, and decision
can be transformed in a similar way. making, extracting information from existing data sets—or, predictive
3. K- nearest neighbor analysis—can be an essential business tool. Predictive models are used to
KNN makes predictions using the training dataset examine existing data and trends to better understand customers and products
directly. Predictions are made for a new instance (x) by searching while also identifying potential future opportunities and risks.
through the entire training set for the K most similar instances (the These business intelligence models create forecasts by integrating data
neighbors) and summarizing the output variable for those K mining, machine learning, statistical modeling, and other data technology.
instances. For regression this might be the mean output variable, in
classification this might be the mode (or most common) class Benefits –
value.  Better predictions better results for students.
To determine which of the K instances in the training  Helpful for students to make a choice of engineering college
dataset are most similar to a new input a distance measure is used. after 12th.
For real-valued input variables, the most popular distance measure  Making it easier for colleges to know where they stand in
is Euclidean distance. attracting students with a cluster of score.
Euclidean distance is calculated as the square root of  Tableau makes it easier for students to understand the current
the sum of the squared differences between a new point (x) and an scenario of admissions by providing graphical representations
existing point (xi) across all input attributes j. of college cutoffs.
Limitations –
 Attracts only certain section of students
 Possibility of shutting down of mhcet
 Noisy data makes predictions hard as training the model is
purely based upon the raw data.
Application –
 Makes it easier for students to select better based upon the
score and cutoff of previous colleges
 Colleges can better understand where to start a campaign.
VII. REFERENCES [2] http://blogs.wsj.com/digits/2014/01/17/amazon-wants-to-ship-your-
package-before-you-buy-it/
The Predictive model has some of the various areas of [3] http://www.entrepreneurial-insights.com/predictive-analytics-forecast-
applications we have studied those application and came up future/
with a variety of methods we can use to perform the task in [4] https://hbr.org/2014/09/a-predictive-analytics-primer/
hand. Below are the websites and books that we have refereed [5] Applied Predictive Modeling, Chapter 7 for regression, Chapter 13 for
throughout the understanding and creation of the project. classification.
[6] Data Mining: Practical Machine Learning Tools and Techniques, page
76 and 128
[1] http://www.webopedia.com/TERM/P/predictive_analytics.html
[7] iitnit.com.

College Guesstimate

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

College Guesstimate

Încărcat de

Drepturi de autor:

Formate disponibile

College Guesstimate

Email: { rbatttulwar99@gmail.com, prasheel.fv123@gmail.com, ananiyamahajan69@gmail.com,

I. INTRODUCTION II. TECHNOLOGIES UESD

Fig. II (i). R Environment While Predicting Outcome of

The fig. II (i). Shows R understand the

Fig. I(i)- The Methodology

V. APPROACH The fundamental learning approach is to recursively

z = logit(v1.y1 + v2y2 + ...), while y1 = logit(w11x1 + w12x2 + ...)

Fig. IV 4. (i)- Multiclass classification results

S-ar putea să vă placă și