Sunteți pe pagina 1din 3

Data Encoding:

I asked the researchers to encode their results, as summarized by the table below. Note that, SAgree and SDisagree stands for Strongly Agree and Disagree.

Out[2]:
SAgree Agree Neutral Disagree SDisagree

0 5 4 3 2 1

For the Recommendation column, 0 was assigned to Commercial by default since, respondents are not ask to provide recommendation for them.

Out[3]:
HighlyRecommended Moderate Recommended LeastRecommended NotRecommended Commercial

0 5 4 3 2 1 0

Similarly, Gender was encoded as 1 - Male and 2 - Female. Family Status is encoded from 1-5, as Mother, Father, Child, Maid and Other respectively.

Out[4]:
Respondent Gender Family Status Product Price Packaging Fragrance Convenience Effectiveness Recommendation

0 1 1 5 Commercial 2 4 4 4 2 0

1 1 1 5 Own 1 2 2 3 1 2

2 2 1 5 Commercial 1 5 5 5 2 0

3 2 1 5 Own 2 1 4 4 3 3

4 3 2 5 Commercial 2 2 4 3 3 0

5 3 2 5 Own 2 4 5 2 3 2

There seems to be some noise in the dataset. The number '4' showed as a Product. We can see from the chart below that it should be Commercial. We'll go ahead and replace
it.

Out[5]:

More than 50% of the respondents has recommended our product. And also, looking at the distribution by family membership status, ~40% of the respondents are parents
(Either a Father or a Mother.)

Out[7]:

19/09/2019, 11:53 am
The distribution by gender is as follows:

The distribution by gender of those who recommended the product are equal (i.e. Both gender recommended the product)
There are slightly more female respondents.

Out[8]:

Now, for the distribution of the metrics by product, we can see the both our product and the commercial product follows the same distribution, except that there are some minor
fluctuations in the data. This is really an interesting case. We will leave to our model to choose which metric describe the recommendation best.

Out[9]:

Now,to understand why the respondents chose their respective responses, we will use a non-parametric approach, namely decision tree.

As mentioned, decision tree is a non parametric approach, This is done by paritioning the input space X into disjoint subsets Ri :
n
X = ⋃ Ri
i=0
such thatRi ∩ Rj = ∅ for i ≠ j
A naive example can be seen below:

Note that we're only concerned with feature importance , not how well the model generalizes the data since our sample size is relatively small.

Out[12]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,


max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')

19/09/2019, 11:53 am
Out[13]:

Just like examining the distribution visually,reviewing the plot of feature importance is already telling us that almost all metric are equally important. Though, we can still assume
that Effectiveness, Fragrance and Price are one of the most important features.

Analysis made by
Benjamin Reyes Cabalona Jr.
Associate Data Scientist at Novare Technologies
benjamin.cabalonajr@novare.com.hk (benjamin.cabalonajr@novare.com.hk)
benjamin.cabalonajr@outlook.com (benjamin.cabalonajr@outlook.com)

19/09/2019, 11:53 am

S-ar putea să vă placă și