top of page

Should we use HDFS or a cloud store for our Data Lakes when deploying in a public cloud?

Updated: Oct 1, 2022

We have found that HDFS is primarily used on premise with dedicated servers. As we discussed earlier, decoupling storage and compute is the typical paradigm for analytics frameworks and deployments in the cloud. Relative to HDFS, Cloud Storage is cheaper, faster, much easier to manage, has consistent security, is interoperable, and is globally replicated and globally consistent. The Cloud Storage connector is preinstalled in Dataproc and EMR, and modifying existing jobs to use Cloud Storage just requires replacing “hdfs:” with “gs:” or “S3” in directory path names.

When deploying cloud-based analytics and data science/ML, using GCS or S3 Cloud Storage has large advantages over HDFS. In terms of cost, you only pay for what you use in Cloud Storage and you do not need to keep over-provisioned GCP servers running 24x7 to host HDFS, which is very expensive. Databricks estimates HDFS has 5x cost over Cloud Storage in cloud deployments and Databricks only supports Cloud Storage based data. Separating compute from storage allows dynamic provisioning of compute clusters: tearing down compute clusters when done and standing up new computer clusters on demand is easy since any cluster can immediately access the data in Cloud Storage and it is cost effective. Storing data in Cloud Storage allows seamless interoperability between applications in the Spark ecosystem, as well as with other cloud services. Windjammer’s SNE aggressive parallel access with GCS C++ libraries provides local storage performance and much better performance than HDFS. GCS and S3 cloud storage are highly available and globally replicated. There is no need for storage management and the same GCP and S3 security models can be applied to storage. Because of these advantages we have focused on cloud storage and optimized for it when deploying on public clouds.



bottom of page