Google Cloud Professional Data Engineer Certification Workshop

Storage Overview


You are taking this course because you want to study for Google’s Professional Data Engineer certification. Part of being a data engineer is configuring databases and creating storage buckets. But before you can start saving data to a service, you need to know which service you should be using. Google Cloud has many different storage services for different use cases. Sometimes it’s hard to understand which one to use when, and sometimes multiple services seem like they would work for the same use case.

In this chapter, you learn about some key data storage metrics, cost, availability, durability, and consistency.  Let’s get started by talking about some storage basics.

Knowing those metrics can help you choose from the many GCP storage services. You will get an overview of those in this chapter.

You will also learn how to architect data-oriented solutions by combining services.

There are many different types and sources of data in the world. Pause the video for a couple minutes and just list some different types of data you need to be able to store.

You may have come up with answers like: binary files like videos and images, text files, structured data like JSON or XML, relational database data, computer programs, machine images and backups, source code, and many other sources.

Pause the video for a couple minutes again and list some things other than data type that you should consider when choosing a storage option.

Your answers might include: the amount of data (storing a MB is a different problem than storing a TB). How many users need to access the data? Is the data public or does it need to be secured? What will the data be used for? Do changes in the data require transactions? Obviously, there are many other considerations you might need to understand.

Choosing the right storage solution depends on many things. What you are storing: videos, emails, code, images, etc. The volume of data. What are the security requirements? How will the data be used after it is stored? Who are your users? Etc.

Storage cost is a key factor when choosing the right storage solution.

And the costs vary greatly. Knowing the requirements of your data will help you make a more informed decision when choosing a storage solution.

As examples, Spanner and Bigtable are two of the most expensive storage options on GCP. One of the key characteristics of Spanner is you can get five 9s availability. If that is required by your use case, it may be well worth paying for. But for most data storage needs that would be overkill. Similarly, Bigtable scales to massive data sets and can ingest vast amounts of data very quickly. But it would be outrageously expensive to store a GB of data in Bigtable, especially when Firestore would be free for that amount.

Availability is the measure or the percentage of time data can be accessed. Storing data on a thumb drive in your desk drawer has extremely low availability, you have to hunt for the drive and insert it into your computer to get access to the files. Storage solutions in the cloud can have extremely high availability. These services are always on and the data can be stored in multiple zones within a region or even in multiple regions. As an example, when using Spanner, you can have a single region or multi-region deployment. In either case, the data is replicated across multiple zones, but if you use one region, you get a 99.99% availability SLA. If you deploy across multiple regions, you get a 99.999% SLA. 

Durability is the likelihood of losing data as a result of a hardware failure. Disks fail regularly, so durability is achieved by writing the data to multiple disks at the same time. Google Cloud Storage is designed for eleven 9s durability. So, it’s highly unlikely that you would lose data because of a hardware failure.

Pause the video again and fill in the table. Based on the scenario at the left, list some real-world use cases.

Delivering web content online might be an example where high availability is very important to you.

Backups, disaster recovery scenarios, and data archives might be examples of a low-cost, durable storage service that is most important.

Caching might be an example of when cost is more important than both availability and durability. Memcache, for example, is free, but the cached data might go away and have to be recreated. 

Different data services also provide different data consistency characteristics.

Some data services offer transactional consistency. That means when one or more operations alter data, all of those operations must succeed or they all must fail and the data needs to be left in the state it was prior to the transaction starting. In GCP, relational database services like Spanner and Cloud SQL allow transactions, as does Firestore.

Eventual consistency is common in distributed systems where there are multiple copies of the data. If data is changed, it is initially changed in one of the copies then in the background, all the other copies are updated with that change. With eventual consistency, it is possible for different nodes in the distributed system to return different results. Bigtable is eventually consistent.

Google Cloud Storage is a distributed data storage service that is strongly consistent. If you overwrite a file in Cloud Storage, the operation is not considered complete until all the copies of the file are updated. So, you would never have a case where two people request the same file at the same time and get different versions of the file.

Data warehouses combine data from many different data sources. Data sources might include relational databases, log files, web analytics data, and many others.

Once aggregated, a data warehouse allows analytics to be done by combining data from these various sources. So, you might be able to combine data about your website traffic with sales from your database to analyze how website usage impacts sales trends.

Data warehouses are also a good way of maintaining a historical archive of data.