Sunteți pe pagina 1din 23

0061094561 CIS8008

Business Intelligence

(course code: CIS8008)

09- October -2017 Semester-2

Assignment-3

This report was prepared and submitted by

Pavankumar.Achana(0061094561)

Student, master of information systems

University of southern Queensland

Toowoomba, Queensland.
0061094561 CIS8008

CONTENTS
TASK 1: RESEARCH AND CRITICAL REVIEW OF LITERATURE ..................................................... 3
TASK 1.1: INFONOMICS APPROACH ................................................................................................. 3
TASK 1.2: SECURITY AND PRIVACY OF “XERO LIMITED COMPANY”..................................... 5
TASK 2: RAPID MINER ............................................................................................................................. 8
TASK 2.1: EXPLORATORY DATA ANALYSIS .................................................................................. 8
TASK 2.2: DECISION TREE MODEL ................................................................................................. 12
TASK 2.3: LOGISTIC REGRESSION USING WEKA EXTENSION ................................................. 13
TASK 2.4: ACCURACY OF FINAL DECISION TREE MODELA ND FINAL LOGISTIC
REGRESSION MODEL ......................................................................................................................... 14
TASK 3: TABLEAU DASHBOARD......................................................................................................... 18
TASK 3.1 ................................................................................................................................................ 18
TASK 3.2 ................................................................................................................................................ 19
TASK 3.3 ................................................................................................................................................ 19
TASK 3.4 ................................................................................................................................................ 20
TASK 3.5: RATIONALE FOR GRAPHIC DESIGN AND FUNCATIONALITY ............................... 21
REFERENCES ........................................................................................................................................... 22
0061094561 CIS8008

TASK 1: RESEARCH AND CRITICAL REVIEW OF LITERATURE

The research and critical review of the topics informatics, different approaches used for
data asset valuation and “xero limited” (Australian organizations listed in Australian Stock
market ASX with code XRO) security and privacy policy statements are discussed in this
section.

TASK 1.1: INFONOMICS APPROACH

Informatics corresponds to the study of information which is subjected to representation,


processing and communication in both natural and engineering systems (Levy, 2015).
Informatics has various aspects such as computational, cognitive and social. Informatics
corresponds to the central notion of transformation of information by subjecting it to
computation or communication. Technological advances are driven by understanding
informational phenomenon (computation, communication and cognition). Informatics is the
emerging disciple that combines both the science of information and engineering of information.
Informatics is used in various academic disciples such as computer science, artificial intelligence
and cognitive sciences (Samuel, 2017).

The debate about whether data is a true asset has been settled as there is a realization of
potential value that the information holds. So, in order to effective generate the value out of
information assets, informatics practices can be applied to the corporate data.

Organizations can use two approaches in order to identify the valuation of data. They are
foundation methods and financial methods. The foundation methods correspond to improving the
disciple of information management and the financial methods corresponds to the improving of
the information and economic benefits (Levy, 2015). There are about 6 methods in which the
valuation of data can be realized for the companies corresponding to these two categories.
0061094561 CIS8008

Source: www.garter.com

These 6 methods analysis the data and creates the aggregate data quality characteristics in
order to determine its intrinsic value (Laney, 2014). The six models used for valuation of data
assets are

DATA ASSET DESCRIPTION


VALUATION METHODS
non-financial (intrinsic) This model focuses on the intrinsic value that corresponds to the
methods data. The qualities that are analyzed in this model are accuracy,
accessibility and completeness corresponding to particular data
or data asset. In order to determine the importance of data asset,
each quality of data is analyzed, rated and then compared with
others in order to determine the final score.
business value of The business value of information assets model analysis the
information data characteristics with respect to business processes. Along
with accuracy and completeness, the parameter timeliness is
also evaluated as the timely data will be useful to business
processes.

performance value of This model corresponds to the impact of data on key


information performance indicators (KPIs) over a period of time. This
model utilizes this data in order to evaluate its performance with
respect to other competitors and determines the value of these
data assets.
cost value of information Cost value information deals with the cost that is incurred while
acquiring or replacing the information that is lost. The lost
revenue is measured by assigning a value to data and the cost
incurred to acquire the data. This is the most common method
for valuation of the intangible assets.
economic value of The contribution of information asset to the revenue of
information organization is measured by this model. The revenue generated
by the information asset provides a sense of the value of the
data.
market value of information This model measures the revenue that is generated by selling,
renting or bartering the corporate data and determines the value
of the information asset. In this model the tracking over time
which corresponds to the number of times the data is sold is an
important factor in determining the value of data asset.

So, organizations can adopt any one of the above mentioned methods in order to
valuating their data assets and effective utilize this valuation in order to create more revenue.
0061094561 CIS8008

TASK 1.2: SECURITY AND PRIVACY OF “XERO LIMITED COMPANY”

Xero Limited Company is the provider online software that develops cloud based
accounting software for SME (small and medium sized enterprise). It is listed in both New
Zealand and Australian stock exchange. It belongs to software and services according to the
global industry classification standard.

The security and privacy policy statements present on their official website provide
assurance that the privacy policy and security policies are up to date by defining the Terms of
use which outlines their services. The security and privacy statement address the confidentiality
and privacy of intellectual property by clearly defining the ownership of data, backup for data
and access limitation for third party application on client’s data. It further outlines the liabilities
and limitations associated with the liabilities (Xero, 2017).

The security provided by Xero Limited company protects the data by outlining the
control assess, adopting user authentication, incorporating data encryption, implementing
network protection and by using secure data centers. Xero implements international standard
SOC 2 (Service Organization Control) which addresses the security, availability, processing
integrity along with the privacy and confidentiality. This SOC2 report provided by Xero limited
outlines the trust service provided by Xero’s cloud based accounting systems (Xero, 2017).

Further, the Data governance Australia (DGA) set standards and benchmarks
corresponding to data collection, usage, management and disclosure in order to meet the current
needs for an approach that is both flexible and self-regulated. It has provided nine core principles
in order to discuss the governance of security and privacy when handling the data. These nine
core principles are
1. No Harm:

No-harm rule refers to the organizations best efforts in order to ensure that no harm is
caused for individual due to the collection, use or disclosure of personal information. While
handling with client’s data, the organizations must act with integrity and not use the data for
unethical use. There should not be any exploitation with respect to subject individual’s data
which is collected by organization.

I think the Xero Limited Company complies with this rule as it clearly outlines in the
privacy policy, the use of data for limited purposes such as to verify the individual’s identify,
administer the services, provide updates, marketing or training related services, technical support
and to communicate. It clearly outlines the limited use of the individuals data so there is no scope
of exploitation of individual’s data.

2. Honesty and transparency:


0061094561 CIS8008

Honesty and transparency corresponds to activity of collecting, using and disclosing


information or data that corresponds to individuals. These activities should comply with privacy
notification statement, privacy policy and reasonable community expectations. Further, this rule
also includes clarity in facts and purpose, provide updates regarding data, provide clear
accessible mechanism to communicate and incorporate consent for disclosing any personal
information.

Xero Limited Company employs various security standards corresponding to data


encryption, network protection, control access, security monitoring in order to protect the data
from any access by any malicious users. It outlines the limited circumstance in which the
individual’s data is disclosed. It only discloses the information in order to comply with legal
process and investigations. Even in those cases Xero Limited will notify the user in order to
disclose the personal information.

3. Fairness:

Fairness rule corresponds to information collection from individuals for the legitimate
business purposes. Fairness rule applies while collection, storing and disclosing the personal
information with respect to circumstances of data collection, period of time for which the data
will be stored, subject to legal requirements and harm for individuals if the data is disclosed.
Organization is responsible for addressing evolving community expectations while handling data
use.

Xero Limited Company updates it privacy policy address the methods of data collection,
period of data storing, conditions to disclose to third-party and many more fairness operating
principles from time to time in order to meet the reasonable expectations of community. These
changes are communicated to the individual in order to outline any significant changes and the
services will be continued only if the individual accepts the updated policy.

4. Choice for subject individual:

Choice corresponds to easily accessible mechanism through which individuals can have
choice in terms of collection and use of personal information. In Xero Limited Company,
individual can request access to his personal information through a simple mechanism of using
email and sending a request at privacy@xero.com. This request will be processed if it is allowed
on legal grounds and update the individual with the necessary data.

5. Accuracy and access:

Accuracy and access corresponds to accurate and not misleading share of data. Further,
the organization needs to provide simple methods in order to update personal information in case
0061094561 CIS8008

of any errors. Further, organizations should encourage development and adoption of industry
standards for effective handling of data.

Xero Limited Company uses industry standards in terms of provided security to the data
collected in order to maintain accuracy in data. These industry standards correspond to control
access, user authentication, data encryption, network protection and securing data centers.
Individual can update his data through chat, email or voice call to Xero Limited customer care
service but only after authentication.

6. Stewardship:

Organizations must ensure that trained human resources are employed for data security
and storage. Along with it, a relevant officer must be responsible for compliance with rules and
regulations and there should be an internal process which ensures this compliance.

Xero Limited employees best human resources in order to handle the compliance issues.
Its chief operating and financial officer Mr. Sankar Narayan oversees the data security standards
along with various internal processes that ensure the compliance with the DGA code.

7. Security:

Security including encryption with respect to industry standards in order to avoid any
security breach must be adopted. Collection of multiple data sets must be in de-identified form. It
should further ensure that third party service providers must comply with the process of ensuring
the security of personal information with respect to recognized industry standards.

Xero Limited Company employs security services that provide high performance and
ensure that quality security services are provided. It has disaster recovery and readiness in case
of geographical adversities. Further, data encryption, network protection, security monitoring
and access control are all developed according to industry standards.

8. Accountability:

Accountability corresponds to process involved in collecting personal information and


disclosure of personal data to third parties. The personal responsible for ensuring the compliance
of code must be publicly available. Organizations must have documents outlining their
commitment and compliance with the code.

Xero Limited Company has listed all the management on its website outlining each
personal responsibility. Further, it outlines its privacy, security and terms of use in order to
provide accountability. These documents outlines the various process that are used for collecting,
storing and disclosing the personal information corresponding to various circumstances.
0061094561 CIS8008

9. Enforcement:

The compliance with DGA code must be enforced regularly and Xero Limited Company
regularly amends its privacy policy and implements the new policies with an updates to
individual users.

TASK 2: RAPID MINER

Rapid Miner tool is used to predict the credit risk of loan applicant by analyzing the
report on the data file creditrisk.csv. This analysis of data is done using exploratory data analysis,
building decision tree model and logistic regression model (Weka extension). After that the two
models decision tree and logistic regression are validated using performance operator.

TASK 2.1: EXPLORATORY DATA ANALYSIS

The exploratory data analysis is performed on the data present in the csv file creditrisk
which consists of 9 variables that determine the credit_risk attribute for a particular applicant.
All these 9 variables are subjected to exploratory data analysis in order to determine their
characteristics (minimum, maximum, average and deviation values). The data file taken for this
analysis creditrisk.csv file does not contain any missing values. The following figure shows the
exploratory data analysis design diagram
0061094561 CIS8008

The attributes Debt_Income_Ratio and credit_risk are taken as a real number and Text
while the remaining are integer data type. The following figure shows the values corresponding
to the attributes credit_risk, Applicant_ID, Credit_Score and Late_Payments

Since, credit_risk attribute is taken as label and is polynomial in nature, the least frequent
value that credit_risk has is “do not lend” and the most frequent value it has is “low”. The
attribute Applicant_ID has the minimum value of 139993 and the maximum value of 980774
with a deviation of 245697.5 and average of about 548953.3. The attribute credit_score has a
minimum value of 397 and the maximum value of 826 with an average of about 613.592 and
deviation of about 117.780. The attribute Late_Payments has a minimum value of 0 and
maximum value of 23 with averSSage of about 5.024 and deviation of 4.565.

The next four attributes are Months_In_Job, Debt_Income_Ratio, Loan_Amt and


Liquid_Assets. Expect Debt_Income_ratio which is real all the other three are Integer data type.
The following figure shows these attributes and corresponding charts
0061094561 CIS8008

The attribute Months_In_Job has a minimum value of 2 and maximum value of 102 with
an average of about 27.455 and deviation of about 18.935. The attribute Debt_Income_Ratio
which is a real value has a minimum value of 0.05 and the maximum value of 12.523 with an
average of about 3.794 and deviation of about 2.514. The attribute Loan_Amt has a minimum
value of 39236 and the maximum value of 449485. The average value of Loan_Amt is 189545.6
and the deviation is about 88783.746. The attribute Liquid_Assets has a minimum value of 830
and the maximum value of 24699 with an average of about 7464.182 and deviation of 5670.381.
The last attribute Num_Credit_Lines has a minimum value of 1 and a maximum value of 12 with
an average of 5.186 and deviation of 2.525.

The correlation between all the attributes expect Credit_Risk (selected as label) is
provided by the correlation matrix which is given by
0061094561 CIS8008

The table shows that there is strong negative correlation between credit_score and
late_payments (-0.779), credit_score and debt_income_ratio (-0.792), credit_score and
num_credit_lines (-0.680), late_payments and Liquid_assets (-0.622), debt_income_ratio and
months_in_job (-0.516), debt_income_ratio and liquid_assets (-0.610), and liquid_assets and
num_credit_lines (-0.563). Further, there is strong positive relation between the attributes
debt_income_ration and num_credit_lines (0.589), liquid_assets and months_in_jobs (0.531),
num_credit_lines and late_payments (0.578), late_payments and debt_income_ratio (0.751),
credit_score and month_in_job (0.619), and liquid_assets and credit_score (0.763).

Performing the chi square test, the attribute weights calculated are given as
0061094561 CIS8008

The top attributes which influence the credit_risk factor are credit_score,
debt_income_ratio, late_payments, num_credit_lines and liquid_assets.

TASK 2.2: DECISION TREE MODEL

The decision tree model represents a sequence of branching operations and each branch
depends upon comparisons of qualities corresponding to each attribute. The decision tree model
process used for predicting the credit risk rating of loan applicant is given as

The process includes selecting of attributes credit_score, debt_income_ratio,


late_payments, num_credit_lines and liquid_asset and then assigning the label to credit_risk in
set role operator. The result of this decision tree model is given by
0061094561 CIS8008

The braches that are color code with red color are the applicants who have low
credit_risk. Each branch ending with a color coded output can be analyzed by clicking on it
which shows the path so that the ending is either very low, low, moderate, high or do not lend.
Applicants with higher credit_score (>782.5), with credit_score (<=782.5) and
debt_income_ratio(<=0.890), with credit_score (<=782.5) and debt_income_ratio(>0.890) and
liquid_assets >20478.5 have very low risk when the credit_risk value is calculated. So, these are
the safe applicants who can be trusted with the loan.

The applicants with debt_income_ration >10.005 are not reliable and they should not be
lent any loan. The appliants with debt_income_ratio>5.008, debt_income_ratio<=5.008 and
credit_score<=518, debt_income_ratio<=5.008 and credit_score>518 and late_payments > 8.5,
and debt_income_ratio<=5.008 and credit_score>518 and late_payments <=8.5 and
liquid_assets>1721 have high risk factor associated with them. So, trusting these members will
be highly risky. All other applicants are either low or moderate with respect to the risk factor
associated with them.

TASK 2.3: LOGISTIC REGRESSION USING WEKA EXTENSION

The following figure shows the design process that is used to calculate the logistic
regression using the weka extenstion. Logistic analysis is predictive analysis which is used to
predict the outcome corresponding to an attribute by analyzing the given set of independent
variables or attributes. The dependent variable here is credit_risk which is polynomial variable
having five output values. It is used to determine the presence of risk factors corresponding to a
specific factor.
0061094561 CIS8008

The process uses retrieve operator to retrieve the data from database and then uses select
attributes to select the top five attributes identified in exploratory data analysis. Next, the set role
operator is used to identify the credit_risk as label. Then the data is sent to w-logistic regression
model (weka extenstion).

The output provided by rapid miner using the w-logistic regression operator is given by

The coefficients are the weights that are applied to each variable attribute before they are
added together. The odd ratios are indicate the amount of influence a change in the attribute or
variable will affect the prediction. Debt_income_ration has value 462144 which indicates that
the credit_risk is very low and the attribute num_credit_lines has a value of 730 and
late_payments has a value of 28.4 indicating that the change it their value will have a significant
impact the credit_risk value corresponding to each loan applicant.

TASK 2.4: ACCURACY OF FINAL DECISION TREE MODELA ND FINAL LOGISTIC


REGRESSION MODEL

VALIDATION OF FINAL DECISION TREE:

The validation of final decision tree is done by incorporating the cross validation operator
which consists of two planes known as training and testing. The training plane consists of
decision tree and the testing plane consists of apply model operator and the performance
operator. The performance operator is used to analyze the performance of final decision tree. The
following figures show the validation of final decision tree.
0061094561 CIS8008

CROSS VALIDATION PROCESS:

Cross validation is a technique which is used for evaluation of predictive models by


sampling the data into training set and testing set. This training set is used to train the model and
the testing set is used to evaluate it.

OUTPUT:
0061094561 CIS8008

The output of the validation of final decision tree provides an accuracy of 97.04% with a
variation of about +/-1.32%. The class precision values and class recall values for each of the
values corresponding to credit_risk attribute are provided as

VALIDATION OF FINAL LOGISTIC REGRESSION MODEL:

The validation of final logistic regression is done by incorporating the cross validation
operator which consists of two planes known as training and testing. The training plane consists
of logistic regression and the testing plane consists of apply model operator and the performance
operator. The performance operator is used to analyze the performance of final decision tree. The
following figures show the validation of final logistic regression.
0061094561 CIS8008

CROSS VALIDATION PROCESS:

Cross validation is a technique which is used for evaluation of predictive models by


sampling the data into training set and testing set. This training set is used to train the model and
the testing set is used to evaluate it.

OUTPUT:
0061094561 CIS8008

The output of validation of final logistic regression has an accuracy of about 95.26% with
a variation of about +/-2.38%. The class precision values and class recall values for each of the
values corresponding to credit_risk attribute are provided as

TASK 3: TABLEAU DASHBOARD

TASK 3.1: The following screen shot shows the frequency of crash nature according to
remoteness and crash severity for every year from January 2001 to December 2016.
0061094561 CIS8008

The nature of crash for every year with respect to remoteness and crash severity is displayed. In
order to find out the nature of the crash we simply have to identify the region, then crash severity
and at last the year of crash.

TASK 3.2:

The following figure provides frequency of crash type over the time period of 24 hours
corresponding to every local government location and specific year.

All we have to do is locate the location of local location and then the crash year we can get the
frequency of crash type over 24 hours in the form of dots identifying each crash type.

TASK 3.3:

The following figure provides an overview of the frequency of crash severity and crash
type by road surface condition corresponding to specific year and police district.
0061094561 CIS8008

In order to find out the frequency of crash severity and crash type we need to select the police
district, road surface condition and crash year we will get the frequency of crash severity
distributed over various crash types.

TASK 3.4:

The following diagram presents the geographical location of crash type and total
causality count corresponding to a given year.
0061094561 CIS8008

Each crash type is represented by a different color and each color presents the crash type
and total casualty count corresponding to particular year. All we need to do is hover over the
point in order to get the information of crash type, location coordinates and total casualty count
for a given year.

TASK 3.5: RATIONALE FOR GRAPHIC DESIGN AND FUNCATIONALITY

Dashboard designing aims at aligning the organizations efforts and help uncover key
insights into the data available so that important decisions can be taken by leveraging the insights
uncovered. The best way to start designing the Dashboards in Tableau software is to understand
the purpose and audience that will view the dashboards.

After understanding the purpose, we need to focus on some of the best practices (curtis,
2017). The outline of the best practices that were used to design the dashboard in the tasks 3.1,
3.2, 3.3 and 3.4

1. Designing the dashboard with the goal in mind will help in outlining the important
aspects and avoids unrelated information.
2. A good visual hierarchy provides user with an easy scan
3. The designing should be aimed at performance, which can be achieved by focusing on
the minimum level of detail that is required for a particular dashboard
4. Do not simple import or duplicate the data in excel sheets, good examination with proper
understanding of the user data will remove any anomalies and null values
5. Condensation of information if needed must be done in a way such that it will not
decrease the meaning
6. Simplicity of design attributes to the elegance of design

In designing the sub task 3.1 which deals with finding out the frequency of crash nature,
the approach was to provide a worksheet that can identify each crash corresponding to
remoteness parameter and then using the severity. This provides a classification of total data by
remoteness and then each remoteness category corresponds to five variations of crash severity.
After achieving this we need to divide this data for each category of crash severity into years
from 2001 to 2016. So we have crash nature on the X-axis and the Y-axis represents crash year
for every crash severity and for every remoteness category.

In designing the sub task 3.2 dealing with finding out crash type over 24 hours, the
approach was to classify crash type according to location and year. After that spread the crash
type over the 24 hour duration so that the frequency of crash type for local government location
and specific year over the 24 hour time period is determined. The Y-axis classifies local
government location and it is further classified into years 2001 to 2016 and the X- axis provides
the data of crash type spread over 24 hour time period. In order to find out the frequency of crash
type over a particular hour we need to select the government location, year and the crash hour.
0061094561 CIS8008

The subtask 3.3 is designed in order to represent the frequency of crash severity and crash
type with respect to the road surface condition. This data is spread across police district and
years 2001-2016. In order to find out the frequency of crash severity and crash type we need to
look at police district and years and then identify the road surface condition. The frequency for
crash severity and crash type corresponding to the local police district “Capricornia” where the
road surface condition is sealed dry can be deduced as “every year”.

For the sub task 3.4, the desired presentation must include crash type and total causality
count corresponding to a given year. So, we need to use the geo-mapping capability of tableau
and it is done by selecting the latitude and longitude attributes in data provided. After that, we
have selected the attributes crash type, total casualty count and the crash year. The crash type
was color coded so that the necessary information corresponding to each crash type such as year
and total casualty count for that year is easily found out by hovering over the required crash type.

According to Stephen few (Guru of Dashboard design), common pitfalls in designing the
tableau dashboard are supplying inadequate context for the data, excessive display of details,
introducing meaningless variety, ineffective highlighting, poor designing display, poor data
arrangement, designing unappealing visual display etc… (Few, 2006) All these pitfalls were
avoiding in designing the subtasks which are simple to read and interpret. This approach has
provided an understanding of effective visual design along with creating highly creative and
effective dashboards. The main aim of using these tableau dashboards is to deliver information in
a clear and quick way through visual design (Few, 2007).

REFERENCES:

1. curtis, R. (2017). Tableau Deep Dive: Dashboard Design - Visual Best Practices. [online]
InterWorks, Inc. Available at:
https://www.interworks.com/blog/rcurtis/2017/06/20/tableau-deep-dive-dashboard-
design-visual-best-practices [Accessed 11 Oct. 2017].
2. Few, S. (2006). Common pitfalls in dashboard design. [online] perceptualedge. Available
at: https://www.perceptualedge.com/articles/Whitepapers/Common_Pitfalls.pdf
[Accessed 11 Oct. 2017].
3. Few, S. (2007). Pervasive Hurdles to Effective Dashboard Design. [online]
erceptualedge. Available at:
https://www.perceptualedge.com/articles/visual_business_intelligence/pervasive_hurdles
_to_dd.pdf [Accessed 11 Oct. 2017].
4. Laney, D. (2014). Six ways to measure the value of your information assets. [online]
SearchCIO. Available at: http://searchcio.techtarget.com/feature/Six-ways-to-measure-
the-value-of-your-information-assets [Accessed 11 Oct. 2017].
5. Levy, H. (2015). Why and How to Value Your Information as an Asset - Smarter With
Gartner. [online] Smarter With Gartner. Available at:
0061094561 CIS8008

http://www.gartner.com/smarterwithgartner/why-and-how-to-value-your-information-as-
an-asset/ [Accessed 11 Oct. 2017].
6. Samuel, G. (2017). Your Data: The balancing act of consumer protection and benefits of
data innovation. [online] DGA. Available at: http://datagovernanceaus.com.au/dga-
chairman-grame-samuel-ac-deliver-address-national-press-club-wednesday-12th-july-
2017/ [Accessed 11 Oct. 2017].
7. Xero. (2017). Cloud Security | Xero. [online] Available at:
https://www.xero.com/ie/about/security/ [Accessed 11 Oct. 2017].

S-ar putea să vă placă și