Documente Academic
Documente Profesional
Documente Cultură
PS: If you have your questions to share ,please write in comment section.
i Knw python. questions are mostly asked on data cleaning , joins, lamda and map
functions. handling date time data . go with pandas and numpy concepts and some
visualisation codes u used in ur projects
Q.7. y=ax+b is a linear model. Can you tell me if y=ax^2 + bx + c is also a linear
model ?
Ans. y=ax^2 + bx + c is also linear as x^2 can also be represented as X.
So, the actual relationship might not be linear but the model fitted is linear
Q.8. What is SSE and RMSE ? Why to use RMSE and not SSE ?
Ans. RMSE has mean value but SSE is total value.
Q.2. What are the error metrics used for both of them ?
Ans. Regression : SSE
Classification : Confusion metrics
Q.3. Why Accuracy does not help when there is a class imbalance ?
Ans. Because accuracy gets drawn by the majority class.
Q.1. What is Decision Tree ? How to split ? How does decision tree work ?
Q.2. What does each node contain in a Decision Tree ?
Q.3. What is Entropy and Genie Index and how does it help ?
Q.4. What is Random Forest ? What is Random in Random Forest ? How to calculate OOB
Error ?
Q.5. How does random forest work ?
Q.6. Explain the entire process from the point you get the data till you reach the
final stage of prediction.
Q.7. How does knn work ? Which distance algorithm to use in knn when data is
categorical ?
Q.8. You have 10 documents. Each topic has been tagged with a topic. Once a new
document comes, how to tag it to one of those topics ?
1) what is the logic behind the python programs that were asked
2) what is softmax function.. explain mathematically
3) what is auc curve.. write a function to plot that.. what is the X axis and y
axis
4) a few questions on decision tree.
5) Random forest
6) let's say a dice is biased and probability of even number is twice of odd.. what
is the expected sum of numbers in nice for 100,000 experiments.
7) how to control over fit in different models
8) how do you know linear regression model is doing well..
9) what does coefficients of logistic regression mean?
10) what is odds?
11) what is idf
12) what is elbow curve in k means clustering
13) what is R square and explain mathematical equations
14) how does rules get extracted in decision tree
15) what is homogenity
16) explain the project you have worked...
17) how do you detect outliers.
18) how does box plot get plotted..the whiskers and all.. how do you detect
outliers there
19)do you know how to use regular expressions.. tell me how you would extract to
email address from the email
ANS 6-
I have used fillna method of pandas dataframe for missing values. i have also used
scikit learn preprocessing library
ANS 9-
k-means is a clustring algorithm and comes under unsupervised learning. In kmeans
we try to predict the clusters in an unlabelled data by claculating the mean of the
similar items and repeating this same process for some time.
where as Knn is an classification algorithm where we calculate the distance of k
nearest neighbours of the particular data point and checking their label and
classify that particular point as one of the label which has come most of the times
in those k neibhours.
18. How do I create linear models if I have very high dimensional data?
(Regularization solves this but mostly want to listen PCA/ removing correlated
variables etc.)
19. What if learning rate is too high or too small while minimizing an objective
function using gradient descent?
20. What is a concave and convex function? Why we prefer convex objective
functions?
21. What is the difference between L1 and l2 regularizations?
22. When will be use L1 vs L2 and vice versa?
23. How gradient descent is applied to l1 regularization? ( l1 regularization
term is non differentiable) ?
24. What is clustering?
25. What is K means Clustering?
26. How do you use ordinal, categorical variables in clustering?
27. What if we have high dimensional data? How do we use apply clustering? (As
distance measure like Euclidean fails for high dimensional data)
28. What are the problems with K means clustering?
29. How do you decide optimal number of clusters?
30. How do you evaluate goodness of clusters? (basic intuition is quantify intra
and inter clusters distance so that intra cluster is minimum and inter cluster is
maximum.. this page helps� https://scikit-
learn.org/stable/modules/clustering.html#clustering-performance-evaluation)
31. Suppose I have two variables which form concentric circles. Which
transformation should be applied so that k means can be applied?
32. What are dimensionality reduction techniques?
33. What is PCA?
34. How do we interpret components of PCA? What do they represent?
35. What is Non Negative matrix Factorization?
36. How do you visualize high dimensional data?
37. When to use PCA or Non Negative Matrix factorization?
38. What is difference between likelihood and probability?
39. What is maximum likelihood estimations algorithm?
40. What are other clustering techniques if K means does not work?
41. What is K median algorithm?
42. What is Decision Tree? ( How split is decided )
43. Why they are prone to overfitting? How do we overcome this problem?
44. What is Bagging and Boosting?
45. Which one is more prone to overfitting?
46. What is Random forest?
47. How variable importance is calculated in Random Forest?
48. How will the testing metrics be impacted if we create 100, 1000, 10000,
100000 and so on till billion tress. AT every step how will it change?
49. How does boosting works?
50. Difference between Gradient Boosting | XG Boost |Light GBM and Cat boost?
51. Why XG boost is so fast?
(b)
data = ['cat', 'bat', 'rat', 'cat', 'rat']
Give the count of each unique element of the list
Answer :
import pandas as pd
data = ['cat', 'bat', 'rat', 'cat', 'rat']
df = pd.DataFrame(data, columns=['Category'])
vc = df['Category'].value_counts()