Documente Academic
Documente Profesional
Documente Cultură
Business Intelligence
Assignment-3
Pavankumar.Achana(0061094561)
Toowoomba, Queensland.
0061094561 CIS8008
CONTENTS
TASK 1: RESEARCH AND CRITICAL REVIEW OF LITERATURE ..................................................... 3
TASK 1.1: INFONOMICS APPROACH ................................................................................................. 3
TASK 1.2: SECURITY AND PRIVACY OF “XERO LIMITED COMPANY”..................................... 5
TASK 2: RAPID MINER ............................................................................................................................. 8
TASK 2.1: EXPLORATORY DATA ANALYSIS .................................................................................. 8
TASK 2.2: DECISION TREE MODEL ................................................................................................. 12
TASK 2.3: LOGISTIC REGRESSION USING WEKA EXTENSION ................................................. 13
TASK 2.4: ACCURACY OF FINAL DECISION TREE MODELA ND FINAL LOGISTIC
REGRESSION MODEL ......................................................................................................................... 14
TASK 3: TABLEAU DASHBOARD......................................................................................................... 18
TASK 3.1 ................................................................................................................................................ 18
TASK 3.2 ................................................................................................................................................ 19
TASK 3.3 ................................................................................................................................................ 19
TASK 3.4 ................................................................................................................................................ 20
TASK 3.5: RATIONALE FOR GRAPHIC DESIGN AND FUNCATIONALITY ............................... 21
REFERENCES ........................................................................................................................................... 22
0061094561 CIS8008
The research and critical review of the topics informatics, different approaches used for
data asset valuation and “xero limited” (Australian organizations listed in Australian Stock
market ASX with code XRO) security and privacy policy statements are discussed in this
section.
The debate about whether data is a true asset has been settled as there is a realization of
potential value that the information holds. So, in order to effective generate the value out of
information assets, informatics practices can be applied to the corporate data.
Organizations can use two approaches in order to identify the valuation of data. They are
foundation methods and financial methods. The foundation methods correspond to improving the
disciple of information management and the financial methods corresponds to the improving of
the information and economic benefits (Levy, 2015). There are about 6 methods in which the
valuation of data can be realized for the companies corresponding to these two categories.
0061094561 CIS8008
Source: www.garter.com
These 6 methods analysis the data and creates the aggregate data quality characteristics in
order to determine its intrinsic value (Laney, 2014). The six models used for valuation of data
assets are
So, organizations can adopt any one of the above mentioned methods in order to
valuating their data assets and effective utilize this valuation in order to create more revenue.
0061094561 CIS8008
Xero Limited Company is the provider online software that develops cloud based
accounting software for SME (small and medium sized enterprise). It is listed in both New
Zealand and Australian stock exchange. It belongs to software and services according to the
global industry classification standard.
The security and privacy policy statements present on their official website provide
assurance that the privacy policy and security policies are up to date by defining the Terms of
use which outlines their services. The security and privacy statement address the confidentiality
and privacy of intellectual property by clearly defining the ownership of data, backup for data
and access limitation for third party application on client’s data. It further outlines the liabilities
and limitations associated with the liabilities (Xero, 2017).
The security provided by Xero Limited company protects the data by outlining the
control assess, adopting user authentication, incorporating data encryption, implementing
network protection and by using secure data centers. Xero implements international standard
SOC 2 (Service Organization Control) which addresses the security, availability, processing
integrity along with the privacy and confidentiality. This SOC2 report provided by Xero limited
outlines the trust service provided by Xero’s cloud based accounting systems (Xero, 2017).
Further, the Data governance Australia (DGA) set standards and benchmarks
corresponding to data collection, usage, management and disclosure in order to meet the current
needs for an approach that is both flexible and self-regulated. It has provided nine core principles
in order to discuss the governance of security and privacy when handling the data. These nine
core principles are
1. No Harm:
No-harm rule refers to the organizations best efforts in order to ensure that no harm is
caused for individual due to the collection, use or disclosure of personal information. While
handling with client’s data, the organizations must act with integrity and not use the data for
unethical use. There should not be any exploitation with respect to subject individual’s data
which is collected by organization.
I think the Xero Limited Company complies with this rule as it clearly outlines in the
privacy policy, the use of data for limited purposes such as to verify the individual’s identify,
administer the services, provide updates, marketing or training related services, technical support
and to communicate. It clearly outlines the limited use of the individuals data so there is no scope
of exploitation of individual’s data.
3. Fairness:
Fairness rule corresponds to information collection from individuals for the legitimate
business purposes. Fairness rule applies while collection, storing and disclosing the personal
information with respect to circumstances of data collection, period of time for which the data
will be stored, subject to legal requirements and harm for individuals if the data is disclosed.
Organization is responsible for addressing evolving community expectations while handling data
use.
Xero Limited Company updates it privacy policy address the methods of data collection,
period of data storing, conditions to disclose to third-party and many more fairness operating
principles from time to time in order to meet the reasonable expectations of community. These
changes are communicated to the individual in order to outline any significant changes and the
services will be continued only if the individual accepts the updated policy.
Choice corresponds to easily accessible mechanism through which individuals can have
choice in terms of collection and use of personal information. In Xero Limited Company,
individual can request access to his personal information through a simple mechanism of using
email and sending a request at privacy@xero.com. This request will be processed if it is allowed
on legal grounds and update the individual with the necessary data.
Accuracy and access corresponds to accurate and not misleading share of data. Further,
the organization needs to provide simple methods in order to update personal information in case
0061094561 CIS8008
of any errors. Further, organizations should encourage development and adoption of industry
standards for effective handling of data.
Xero Limited Company uses industry standards in terms of provided security to the data
collected in order to maintain accuracy in data. These industry standards correspond to control
access, user authentication, data encryption, network protection and securing data centers.
Individual can update his data through chat, email or voice call to Xero Limited customer care
service but only after authentication.
6. Stewardship:
Organizations must ensure that trained human resources are employed for data security
and storage. Along with it, a relevant officer must be responsible for compliance with rules and
regulations and there should be an internal process which ensures this compliance.
Xero Limited employees best human resources in order to handle the compliance issues.
Its chief operating and financial officer Mr. Sankar Narayan oversees the data security standards
along with various internal processes that ensure the compliance with the DGA code.
7. Security:
Security including encryption with respect to industry standards in order to avoid any
security breach must be adopted. Collection of multiple data sets must be in de-identified form. It
should further ensure that third party service providers must comply with the process of ensuring
the security of personal information with respect to recognized industry standards.
Xero Limited Company employs security services that provide high performance and
ensure that quality security services are provided. It has disaster recovery and readiness in case
of geographical adversities. Further, data encryption, network protection, security monitoring
and access control are all developed according to industry standards.
8. Accountability:
Xero Limited Company has listed all the management on its website outlining each
personal responsibility. Further, it outlines its privacy, security and terms of use in order to
provide accountability. These documents outlines the various process that are used for collecting,
storing and disclosing the personal information corresponding to various circumstances.
0061094561 CIS8008
9. Enforcement:
The compliance with DGA code must be enforced regularly and Xero Limited Company
regularly amends its privacy policy and implements the new policies with an updates to
individual users.
Rapid Miner tool is used to predict the credit risk of loan applicant by analyzing the
report on the data file creditrisk.csv. This analysis of data is done using exploratory data analysis,
building decision tree model and logistic regression model (Weka extension). After that the two
models decision tree and logistic regression are validated using performance operator.
The exploratory data analysis is performed on the data present in the csv file creditrisk
which consists of 9 variables that determine the credit_risk attribute for a particular applicant.
All these 9 variables are subjected to exploratory data analysis in order to determine their
characteristics (minimum, maximum, average and deviation values). The data file taken for this
analysis creditrisk.csv file does not contain any missing values. The following figure shows the
exploratory data analysis design diagram
0061094561 CIS8008
The attributes Debt_Income_Ratio and credit_risk are taken as a real number and Text
while the remaining are integer data type. The following figure shows the values corresponding
to the attributes credit_risk, Applicant_ID, Credit_Score and Late_Payments
Since, credit_risk attribute is taken as label and is polynomial in nature, the least frequent
value that credit_risk has is “do not lend” and the most frequent value it has is “low”. The
attribute Applicant_ID has the minimum value of 139993 and the maximum value of 980774
with a deviation of 245697.5 and average of about 548953.3. The attribute credit_score has a
minimum value of 397 and the maximum value of 826 with an average of about 613.592 and
deviation of about 117.780. The attribute Late_Payments has a minimum value of 0 and
maximum value of 23 with averSSage of about 5.024 and deviation of 4.565.
The attribute Months_In_Job has a minimum value of 2 and maximum value of 102 with
an average of about 27.455 and deviation of about 18.935. The attribute Debt_Income_Ratio
which is a real value has a minimum value of 0.05 and the maximum value of 12.523 with an
average of about 3.794 and deviation of about 2.514. The attribute Loan_Amt has a minimum
value of 39236 and the maximum value of 449485. The average value of Loan_Amt is 189545.6
and the deviation is about 88783.746. The attribute Liquid_Assets has a minimum value of 830
and the maximum value of 24699 with an average of about 7464.182 and deviation of 5670.381.
The last attribute Num_Credit_Lines has a minimum value of 1 and a maximum value of 12 with
an average of 5.186 and deviation of 2.525.
The correlation between all the attributes expect Credit_Risk (selected as label) is
provided by the correlation matrix which is given by
0061094561 CIS8008
The table shows that there is strong negative correlation between credit_score and
late_payments (-0.779), credit_score and debt_income_ratio (-0.792), credit_score and
num_credit_lines (-0.680), late_payments and Liquid_assets (-0.622), debt_income_ratio and
months_in_job (-0.516), debt_income_ratio and liquid_assets (-0.610), and liquid_assets and
num_credit_lines (-0.563). Further, there is strong positive relation between the attributes
debt_income_ration and num_credit_lines (0.589), liquid_assets and months_in_jobs (0.531),
num_credit_lines and late_payments (0.578), late_payments and debt_income_ratio (0.751),
credit_score and month_in_job (0.619), and liquid_assets and credit_score (0.763).
Performing the chi square test, the attribute weights calculated are given as
0061094561 CIS8008
The top attributes which influence the credit_risk factor are credit_score,
debt_income_ratio, late_payments, num_credit_lines and liquid_assets.
The decision tree model represents a sequence of branching operations and each branch
depends upon comparisons of qualities corresponding to each attribute. The decision tree model
process used for predicting the credit risk rating of loan applicant is given as
The braches that are color code with red color are the applicants who have low
credit_risk. Each branch ending with a color coded output can be analyzed by clicking on it
which shows the path so that the ending is either very low, low, moderate, high or do not lend.
Applicants with higher credit_score (>782.5), with credit_score (<=782.5) and
debt_income_ratio(<=0.890), with credit_score (<=782.5) and debt_income_ratio(>0.890) and
liquid_assets >20478.5 have very low risk when the credit_risk value is calculated. So, these are
the safe applicants who can be trusted with the loan.
The applicants with debt_income_ration >10.005 are not reliable and they should not be
lent any loan. The appliants with debt_income_ratio>5.008, debt_income_ratio<=5.008 and
credit_score<=518, debt_income_ratio<=5.008 and credit_score>518 and late_payments > 8.5,
and debt_income_ratio<=5.008 and credit_score>518 and late_payments <=8.5 and
liquid_assets>1721 have high risk factor associated with them. So, trusting these members will
be highly risky. All other applicants are either low or moderate with respect to the risk factor
associated with them.
The following figure shows the design process that is used to calculate the logistic
regression using the weka extenstion. Logistic analysis is predictive analysis which is used to
predict the outcome corresponding to an attribute by analyzing the given set of independent
variables or attributes. The dependent variable here is credit_risk which is polynomial variable
having five output values. It is used to determine the presence of risk factors corresponding to a
specific factor.
0061094561 CIS8008
The process uses retrieve operator to retrieve the data from database and then uses select
attributes to select the top five attributes identified in exploratory data analysis. Next, the set role
operator is used to identify the credit_risk as label. Then the data is sent to w-logistic regression
model (weka extenstion).
The output provided by rapid miner using the w-logistic regression operator is given by
The coefficients are the weights that are applied to each variable attribute before they are
added together. The odd ratios are indicate the amount of influence a change in the attribute or
variable will affect the prediction. Debt_income_ration has value 462144 which indicates that
the credit_risk is very low and the attribute num_credit_lines has a value of 730 and
late_payments has a value of 28.4 indicating that the change it their value will have a significant
impact the credit_risk value corresponding to each loan applicant.
The validation of final decision tree is done by incorporating the cross validation operator
which consists of two planes known as training and testing. The training plane consists of
decision tree and the testing plane consists of apply model operator and the performance
operator. The performance operator is used to analyze the performance of final decision tree. The
following figures show the validation of final decision tree.
0061094561 CIS8008
OUTPUT:
0061094561 CIS8008
The output of the validation of final decision tree provides an accuracy of 97.04% with a
variation of about +/-1.32%. The class precision values and class recall values for each of the
values corresponding to credit_risk attribute are provided as
The validation of final logistic regression is done by incorporating the cross validation
operator which consists of two planes known as training and testing. The training plane consists
of logistic regression and the testing plane consists of apply model operator and the performance
operator. The performance operator is used to analyze the performance of final decision tree. The
following figures show the validation of final logistic regression.
0061094561 CIS8008
OUTPUT:
0061094561 CIS8008
The output of validation of final logistic regression has an accuracy of about 95.26% with
a variation of about +/-2.38%. The class precision values and class recall values for each of the
values corresponding to credit_risk attribute are provided as
TASK 3.1: The following screen shot shows the frequency of crash nature according to
remoteness and crash severity for every year from January 2001 to December 2016.
0061094561 CIS8008
The nature of crash for every year with respect to remoteness and crash severity is displayed. In
order to find out the nature of the crash we simply have to identify the region, then crash severity
and at last the year of crash.
TASK 3.2:
The following figure provides frequency of crash type over the time period of 24 hours
corresponding to every local government location and specific year.
All we have to do is locate the location of local location and then the crash year we can get the
frequency of crash type over 24 hours in the form of dots identifying each crash type.
TASK 3.3:
The following figure provides an overview of the frequency of crash severity and crash
type by road surface condition corresponding to specific year and police district.
0061094561 CIS8008
In order to find out the frequency of crash severity and crash type we need to select the police
district, road surface condition and crash year we will get the frequency of crash severity
distributed over various crash types.
TASK 3.4:
The following diagram presents the geographical location of crash type and total
causality count corresponding to a given year.
0061094561 CIS8008
Each crash type is represented by a different color and each color presents the crash type
and total casualty count corresponding to particular year. All we need to do is hover over the
point in order to get the information of crash type, location coordinates and total casualty count
for a given year.
Dashboard designing aims at aligning the organizations efforts and help uncover key
insights into the data available so that important decisions can be taken by leveraging the insights
uncovered. The best way to start designing the Dashboards in Tableau software is to understand
the purpose and audience that will view the dashboards.
After understanding the purpose, we need to focus on some of the best practices (curtis,
2017). The outline of the best practices that were used to design the dashboard in the tasks 3.1,
3.2, 3.3 and 3.4
1. Designing the dashboard with the goal in mind will help in outlining the important
aspects and avoids unrelated information.
2. A good visual hierarchy provides user with an easy scan
3. The designing should be aimed at performance, which can be achieved by focusing on
the minimum level of detail that is required for a particular dashboard
4. Do not simple import or duplicate the data in excel sheets, good examination with proper
understanding of the user data will remove any anomalies and null values
5. Condensation of information if needed must be done in a way such that it will not
decrease the meaning
6. Simplicity of design attributes to the elegance of design
In designing the sub task 3.1 which deals with finding out the frequency of crash nature,
the approach was to provide a worksheet that can identify each crash corresponding to
remoteness parameter and then using the severity. This provides a classification of total data by
remoteness and then each remoteness category corresponds to five variations of crash severity.
After achieving this we need to divide this data for each category of crash severity into years
from 2001 to 2016. So we have crash nature on the X-axis and the Y-axis represents crash year
for every crash severity and for every remoteness category.
In designing the sub task 3.2 dealing with finding out crash type over 24 hours, the
approach was to classify crash type according to location and year. After that spread the crash
type over the 24 hour duration so that the frequency of crash type for local government location
and specific year over the 24 hour time period is determined. The Y-axis classifies local
government location and it is further classified into years 2001 to 2016 and the X- axis provides
the data of crash type spread over 24 hour time period. In order to find out the frequency of crash
type over a particular hour we need to select the government location, year and the crash hour.
0061094561 CIS8008
The subtask 3.3 is designed in order to represent the frequency of crash severity and crash
type with respect to the road surface condition. This data is spread across police district and
years 2001-2016. In order to find out the frequency of crash severity and crash type we need to
look at police district and years and then identify the road surface condition. The frequency for
crash severity and crash type corresponding to the local police district “Capricornia” where the
road surface condition is sealed dry can be deduced as “every year”.
For the sub task 3.4, the desired presentation must include crash type and total causality
count corresponding to a given year. So, we need to use the geo-mapping capability of tableau
and it is done by selecting the latitude and longitude attributes in data provided. After that, we
have selected the attributes crash type, total casualty count and the crash year. The crash type
was color coded so that the necessary information corresponding to each crash type such as year
and total casualty count for that year is easily found out by hovering over the required crash type.
According to Stephen few (Guru of Dashboard design), common pitfalls in designing the
tableau dashboard are supplying inadequate context for the data, excessive display of details,
introducing meaningless variety, ineffective highlighting, poor designing display, poor data
arrangement, designing unappealing visual display etc… (Few, 2006) All these pitfalls were
avoiding in designing the subtasks which are simple to read and interpret. This approach has
provided an understanding of effective visual design along with creating highly creative and
effective dashboards. The main aim of using these tableau dashboards is to deliver information in
a clear and quick way through visual design (Few, 2007).
REFERENCES:
1. curtis, R. (2017). Tableau Deep Dive: Dashboard Design - Visual Best Practices. [online]
InterWorks, Inc. Available at:
https://www.interworks.com/blog/rcurtis/2017/06/20/tableau-deep-dive-dashboard-
design-visual-best-practices [Accessed 11 Oct. 2017].
2. Few, S. (2006). Common pitfalls in dashboard design. [online] perceptualedge. Available
at: https://www.perceptualedge.com/articles/Whitepapers/Common_Pitfalls.pdf
[Accessed 11 Oct. 2017].
3. Few, S. (2007). Pervasive Hurdles to Effective Dashboard Design. [online]
erceptualedge. Available at:
https://www.perceptualedge.com/articles/visual_business_intelligence/pervasive_hurdles
_to_dd.pdf [Accessed 11 Oct. 2017].
4. Laney, D. (2014). Six ways to measure the value of your information assets. [online]
SearchCIO. Available at: http://searchcio.techtarget.com/feature/Six-ways-to-measure-
the-value-of-your-information-assets [Accessed 11 Oct. 2017].
5. Levy, H. (2015). Why and How to Value Your Information as an Asset - Smarter With
Gartner. [online] Smarter With Gartner. Available at:
0061094561 CIS8008
http://www.gartner.com/smarterwithgartner/why-and-how-to-value-your-information-as-
an-asset/ [Accessed 11 Oct. 2017].
6. Samuel, G. (2017). Your Data: The balancing act of consumer protection and benefits of
data innovation. [online] DGA. Available at: http://datagovernanceaus.com.au/dga-
chairman-grame-samuel-ac-deliver-address-national-press-club-wednesday-12th-july-
2017/ [Accessed 11 Oct. 2017].
7. Xero. (2017). Cloud Security | Xero. [online] Available at:
https://www.xero.com/ie/about/security/ [Accessed 11 Oct. 2017].