Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
Ebook869 pages8 hours

Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist

Rating: 4.5 out of 5 stars

4.5/5

()

Read preview

About this ebook

Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things.

Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together.

  • New case studies include expanded coverage of textual management and analytics
  • New chapters on visualization and big data
  • Discussion of new visualizations of the end-state architecture
LanguageEnglish
Release dateApr 30, 2019
ISBN9780128169179
Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
Author

W.H. Inmon

Best known as the “Father of Data Warehousing," Bill Inmon has become the most prolific and well-known author worldwide in the big data analysis, data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the “Ten IT People Who Mattered in the Last 40 Years” of the computer profession. Having 35 years of experience in database technology and data warehouse design, he is known globally for his seminars on developing data warehouses and information architectures. Bill has been a keynote speaker in demand for numerous computing associations, industry conferences and trade shows. Bill Inmon also has an extensive entrepreneurial background: He founded Pine Cone Systems, later named Ambeo in 1995, and founded, and took public, Prism Solutions in 1991. Bill consults with a large number of Fortune 1000 clients, and leading IT executives on Data Warehousing, Business Intelligence, and Database Management, offering data warehouse design and database management services, as well as producing methodologies and technologies that advance the enterprise architectures of large and small organizations world-wide. He has worked for American Management Systems and Coopers & Lybrand. Bill received his Bachelor of Science degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University.

Read more from W.H. Inmon

Related to Data Architecture

Related ebooks

Databases For You

View More

Related articles

Reviews for Data Architecture

Rating: 4.666666666666667 out of 5 stars
4.5/5

3 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Architecture - W.H. Inmon

    data.

    Chapter 1.2

    The Data Infrastructure

    Abstract

    Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the great divide. The divide is so large; many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

    Keywords

    Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

    If there is any secret to data management and data architecture, it is understanding data in terms of its infrastructure. Stated differently, trying to understand the larger architecture under which data are managed and operate is almost impossible without understanding the underlying infrastructure, which surrounds data. Therefore, we shall spend some time understanding infrastructure.

    Two Types of Repetitive Data

    A good starting point for understanding infrastructure is to start with the observation that there are two types of repetitive data found in corporate data. In the structured side of corporate data, repetitive data are found. In the unstructured big data side of corporate data, repetitive data are also found. Despite the fact that the types of data sound the same, there are significant differences between the different types of repetitive data. When it comes to structured repetitive data, it is normal to have transactions as part of the repetitive data. There are sales transactions, stocking of SKU transactions, inventory replenishment transactions, payment transactions, and so forth. In the structured world, there are many of these transactions that find their way into the repetitive structured world.

    The other kind of repetitive data is the repetitive data found in the unstructured big data world. In the unstructured big data world, we might have metering data, analog data, manufacturing data, clickstream data, and so forth.

    There is the question then—are these types of repetitive data the same? They certainly are repetitive. But these different types of repetitive data are not the same. What is the difference then between these two types of repetitive data? Fig. 1.2.1 shows (symbolically) these two types of repetitive data.

    Fig. 1.2.1 Two types of repetitive data.

    Repetitive Structured Data

    In order to understand the differences between these two types of repetitive data, it is necessary to understand each type of data individually. Let's start with repetitive structured data. Fig. 1.2.2 shows the repetitive structured data are broken into records and blocks.

    Fig. 1.2.2 Repetitive data broken into blocks.

    The most basic unit of information in the repetitive structured environment is a block of data. Inside each block of data are records of data.

    Fig. 1.2.3 shows a simple record of data.

    Fig. 1.2.3 Records inside a block.

    Each record of data is (normally!) representative of a transaction. For example, there are records of data representing the sale of a product. Each record is representative of a single sale.

    Inside each record are keys, attributes, and indexes. Fig. 1.2.4 shows the anatomy of a record.

    Fig. 1.2.4 Attributes, keys, and indexes.

    If a record is representative of a sale, the attributes might be information about the date of the sale, the item sold, the cost of the item, any tax on the item, who bought the item, and so forth. The key of the record is one or more attributes that uniquely define the record. The key for a sale might be the date of sale, item sold, and location of the sale.

    The indexes that are attached to the record are on the attributes that are needed when there is a desire to have quick access to the record.

    The infrastructure that is attached to structured repetitive data managed under a DBMS is seen in Fig. 1.2.5.

    Fig. 1.2.5 A standard DBMS.

    Repetitive Big Data

    The other type of repetitive data is repetitive data found in big data. Fig. 1.2.6 depicts the repetitive data found in big data.

    Fig. 1.2.6 Repetitive big data.

    At first glance, there are just a lot of repetitive records seen in Fig. 1.2.6. But upon closer examination, it is seen that all of those repetitive big data records are packed away into a string of data and that string of data is stored inside a block of data, as seen in Fig. 1.2.7.

    Fig. 1.2.7 A block of data.

    The structured infrastructure seen in Fig. 1.2.7 is typical of an infrastructure managed under one of several DBMS such as Oracle, SQL Server, and DB2.

    The infrastructure for big data is quite different than the infrastructure found in a standard DBMS. In the infrastructure for big data, there is a block. And in the block are found many repetitive records. Each record is merely concatenated to each other record. Fig. 1.2.8 is representative of a record that might be found in big data.

    Fig. 1.2.8 Records inside the block.

    In Fig. 1.2.8, it is seen that there is merely a long string of data, with records stacked one against the other. The system only sees the block and the long string of data. In order to find a record, the system needs to parse the string, as seen in Fig. 1.2.9.

    Fig. 1.2.9 Parsing records inside the block.

    Suppose the system wants to find a given record. The system needs to sequentially read the string of data until it recognizes that there is a record. Then, the system needs to go into the record and determine whether it is record B. This is how a search is conducted in the most primitive state in big data.

    It doesn’t take much of an imagination to see that a lot of machine cycles are chewed up looking for data in big data. To this end, the big data environment employs a means of processing referred to as the Roman census approach. More will be described about the Roman census approach in the chapter on big data.

    The Two Infrastructures

    The two different infrastructures are contrasted in Fig. 1.2.10.

    Fig. 1.2.10 Two different infrastructures.

    Without much effort, it is seen that the infrastructures surrounding big data and structured data are quite different. The infrastructure surrounding big data is quite simple and streamlined. The infrastructure surrounding structured DBMS data is elaborate and anything but streamlined.

    There is then no argument as to the fact that there are significant differences between the infrastructure of repetitive structured data and repetitive big data.

    What's Being Optimized?

    When looking at the two infrastructures, it is natural to ask—what is being optimized by the different infrastructures. In the case of big data, the optimization of the infrastructure is on the ability of the system to manage almost unlimited amounts of data. Fig. 1.2.11 shows that with the infrastructure of big data, adding new data is a very easy and streamlined thing to do.

    Fig. 1.2.11 Optimal for storing massive amounts of data.

    But the infrastructure behind a structured DBMS is optimized for something quite different than managing huge amounts of data. In the case of the structured DBMS environment, the optimization is on the ability to find any one given unit of data quickly and efficiently.

    Fig. 1.2.12 shows the optimization of the infrastructure of a standard structured DBMS.

    Fig. 1.2.12 Optimal for direct online access of data.

    Comparing the Two Infrastructures

    Another way to think of the different infrastructures is in terms of the amount of data and overhead required to find a given unit of data. In order to find a given unit of data, the big data environment has to search through a whole host of data. Many input/output operations (I/Os) have got to be done to find a given item. To find that same item in a structured DBMS environment, only a few I/Os need to be done. So if you want to optimize on the speed of access of data, the standard structured DBMS is the way to go.

    On the other hand, in order to achieve the speed of access, an elaborate infrastructure for data is required by the standard structured DBMS. An infrastructure must be both built and maintained over time, as data change. A considerable amount of system resources is required for the building and maintenance of this infrastructure. But when it comes to big data, the infrastructure required to be built and maintained is nil. The big data infrastructure is built easily and maintained very easily.

    This section began with the proposition that repetitive data can be found in both the structured and big data environment. At first glance, the repetitive data are the same or are very similar. But when you look at the infrastructure and the mechanics implied in the infrastructure, it is seen that the repetitive data in each of the environments are indeed very different.

    Chapter 1.3

    The Great Divide

    Abstract

    Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the great divide. The divide is so large that many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

    Keywords

    Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

    Classifying Corporate Data

    Corporate data can be classified in many different ways. One of the major classifications is by structured versus unstructured data. And unstructured data can be further broken into two categories—repetitive unstructured data and nonrepetitive unstructured data. This division of data is shown in Fig.

    Enjoying the preview?
    Page 1 of 1