Sunteți pe pagina 1din 3

BigQuery, Google’s Enterprise Data Warehouse

Slid02:

The BigQuery service replaces the typical hardware setup for a traditional data warehouse. That is, it
serves as a collective home for all analytical data in an organization.

Datasets are collections of tables that can be divided along business lines or a given analytical domain.
Each dataset is tied to a GCP project.

A data lake might contain files in Cloud Storage or Google Drive or transactional data in Cloud Bigtable.
BigQuery can define a schema and issue queries directly on external data as federated data sources.

Database tables and views function the same way in BigQuery as they do in a traditional data warehouse,
allowing BigQuery to support queries written in a standard SQL dialect which is ANSI: 2011 compliant.

Cloud Identity and Access Management is used to grant permission to perform specific actions in
BigQuery. This replaces the SQL GRANT and REVOKE statements that are used to manage access
permissions in traditional SQL databases.

Slid03:

Traditional data warehouses are hard to manage and operate. They were designed for a batch paradigm
of data analytics and for operational reporting needs. The data in the data warehouse was meant to be
used by only a few management folks for reporting purposes. BigQuery is a modern data warehouse that
changes the conventional mode of data warehousing. Here we can see some of the key comparisons
between a traditional data warehouse and BigQuery.

BigQuery provides mechanisms for automated data transfer and powers applications teams already know
and use, so everyone has access to data insights. You can create read-only shared data sources that both
internal and external users can query, and make query results accessible for anyone through user-friendly
tools such as Google Sheets, Tableau, Qlik, or Google Data Studio.

BigQuery lays the foundation for AI. It’s possible to train Tensorflow and Google Cloud Machine Learning
models directly with data sets stored in BigQuery, and BigQuery ML can be used to build and train machine
learning models with simple SQL. Another extended capability is BigQuery GIS, which allows organizations
to analyze geographic data in BigQuery, essential to many critical business decisions that revolve around
location data.

BigQuery also allows organizations to analyze business events real-time, as they unfold, by automatically
ingesting data and making it immediately available to query in their data warehouse. This is supported by
the ability of BigQuery to ingest up to 100,000 rows of data per second and for petabytes of data to be
queried at lightning-fast speeds.

Due to Google’s fully managed, serverless infrastructure and globally available network, BigQuery
eliminates the work associated with provisioning and maintaining a traditional data warehousing
infrastructure.
BigQuery also simplifies data operations through the use of Identity and Access Management to control
user access to resources, creating roles and groups and assigning permissions for running jobs and queries
in a project, and also providing automatic data backup and replications.

Slid04:

BigQuery ML democratizes machine learning by enabling SQL practitioners to build models using existing
SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.

BigQuery ML functionality is available by using:

● The BigQuery web UI

● The bq command-line tool

● The BigQuery REST API

● An external tool such as a Jupyter notebook or business intelligence platform

Machine learning on large data sets requires extensive programming and knowledge of ML frameworks.
These requirements restrict solution development to a very small set of people within each company, and
they exclude data analysts who understand the data but have limited machine learning knowledge and
programming expertise.

BigQuery ML empowers data analysts to use machine learning through existing SQL tools and skills.
Analysts can use BigQuery ML to build and evaluate ML models in BigQuery. Analysts no longer need to
export small amounts of data to a spreadsheets or other applications, and analysts no longer need to wait
for limited resources from a data science team.

Slid05:

BigQuery is a fully-managed service, which means that the BigQuery engineering team takes care of
updates and maintenance. Upgrades shouldn't require downtime or hinder system performance.

Slid06:

Users don't need to provision resources before using BigQuery, unlike many RDBMS systems. BigQuery
allocates storage and query resources dynamically based on usage patterns.

Storage resources are allocated as users consume them and deallocated as they remove data or drop
tables.

Query resources are allocated according to query type and complexity. Each query uses a number of slots,
which are units of computation that comprise a certain amount of CPU and RAM.

Users don't have to make a minimum usage commitment to use BigQuery. The service allocates and
charges for resources based on their actual usage. By default, all BigQuery users have access to 2,000 slots
for query operations. They can also reserve a fixed number of slots for their project.
Slid07:

While there are situations where you can query data without loading it, for example when using public or
shared datasets, Stackdriver log files, or external data sources, for all other situations you must first load
your data into BigQuery before you can run queries. In most cases, you load data into BigQuery storage,
and if you want to get the data back out of BigQuery, you can export the data.

The gsutil tool is a Python application that lets you access Cloud Storage from the command line. You can
use gsutil to do a wide range of bucket and object management tasks, including uploading, downloading,
and deleting objects. The officially supported installation and update method for gsutil is as part of the
Google Cloud SDK.

The BigQuery command-line tool is another Python-based command-line tool, and it is also installed
through the Google Cloud SDK. The bq command-line tool serves many functions within BigQuery, but for
loading, it’s good for large data files, scheduling uploads, creating tables, defining schema, and loading
data with one command.

You can use the BigQuery web UI in the GCP Console as a visual interface to complete various tasks,
including loading and exporting data, as well as running queries.

The BigQuery API allows a wide range of services, such as Cloud Dataflow and Cloud Dataproc, to load or
extract data to and from BigQuery.

The BigQuery Data Transfer Service for Cloud Storage allows you to schedule recurring data loads from
Cloud Storage to BigQuery. It also automates data movement from a range of SaaS applications to
BigQuery on a scheduled, managed basis. The BigQuery Data Transfer Service is accessed through the GCP
Console, the BigQuery web UI, the bq command-line tool, or the BigQuery Data Transfer Service API.

Another alternative to loading data is to stream the data one record at a time. Streaming is typically used
when you need the data to be immediately available, such as fraud detection or monitoring system
metrics. While load jobs are free in BigQuery, there is a charge for streaming data. Therefore, it's
important to use streaming in situations where the benefits outweigh the costs.

---------------------------------------

Additional information on the gsutil tool: https://cloud.google.com/storage/docs/gsutil

Additional information on the bq command-line tool: https://cloud.google.com/bigquery/docs/bq-


command-line-tool

Additional information on the BigQuery Data Transfer Service:


https://cloud.google.com/bigquery/docs/transfer-service-overview

Slid08:

To take full advantage of BigQuery as an analytical engine, you should store the data in BigQuery storage.
However, your specific use case might benefit from analyzing external sources either by themselves or
JOINed with data in BigQuery storage.

Google Data Studio, as well as many partner tools that are already integrated with BigQuery, can be used
to draw analytics from BigQuery and build sophisticated interactive data visualizations.

S-ar putea să vă placă și