SQL Server on Linux Journey – Part 6: Big Data Clusters
On Part 5 of this series we talked about containers, which can be leveraged as a type of SQL engine deployment and as a solution to take advantage of Big Data Clusters.
Now, we will move on to Part 6 (and final) to cover Big Data Clusters (BDC) which, according to Asad Khan, was built by keeping the latest industry scenarios front and center. Asad Khan is Director of Program Management at Microsoft and described Microsoft`s journey while building BDC on a recent Microsoft Build 2020 session.
He went on to say that SQL Server needed to embrace data lakes, which are now mainstream, and differentiate its offering by closing the gap between structured and unstructured data. And by integrating the SQL Engine, Spark, and data lakes, SQL Server offers a truly unified data platform that serves OLTP workloads as well as enables customers to do analytics at scale.
What do I need to know about Big Data Clusters?
Argenis Fernandez, former Principal Program Manager with the Microsoft SQL Server team, described the history of SQL and launch of BDC in more detail on this PASS Virtual Group Session.
He said that SQL Server had sporadic support for unstructured data sets like XML/JSON since Polybase 1.0 enabled it to look at external data sets a few years ago but now; SQL Server 2019 takes this concept to a new level. It is enabling data visualization and turning SQL into a single point of entry for entire datacenters with centralized security. Customers can enable data governance and firewalling by making data sets (e.g.: Mongo DB, Teradata, Oracle, DB2, SAP, and IoT) available while keeping rules of access and engagement inside SQL.
In addition to managing metadata and scaling out queries, a cool feature of the master node allows it to serve as an endpoint to connect your Business Intelligence (BI) tools or SQL Server tools such as Azure Data Studio and SQL Server Management Studio.
In-depth view of BDC architecture
Still following Argenis explanation, Microsoft is fully embracing Apache Spark (a data streaming/data crunching solution) so, in the same Kubernetes instance, you can be running SQL Server Engine and Apache Spark runtime talking to an underlying HDFS store (HDFS is a scalable file system that has powered Hadoop for several years). SQL server can now natively read HDFS.
Kubernetes (to sum it up for people who are not yet familiar) is a container orchestration platform based on nodes where you can run containers. And BDCs are based on multiple Kubernetes pods, each providing single units of isolation. A pod can access a HDFS data node which is based upon persistent storage. Keep in mind containers don`t need to be dealing with ephemeral data every time and this is an important concept for SQL Server. If you use persistent storage, data will remain after the lifetime of that container, and you can leverage SUSE Enterprise Storage, which is powered by ceph technology, in that scenario. Kubernetes can use several ceph storage types like RBD, cephfs and SES is one of the default CSI plugins for Kubernetes.
So now you have a complete AI platform, where you can run R scripts, Python scripts, even Java within the same memory space of SQL Server (same machine) and Apache Spark can look at all datasets. Learn more here.
Microsoft & Open Source – Why SUSE?
Patrick Quairoli, Sr. Director of Technical Alliances & Certifications at SUSE, covered some important topics on the same PASS Virtual Group Session where he said that “everything is built on the shoulders of everything else.”
We started with databases -> data proliferation -> data consolidation -> data warehouses -> data analytics through abstraction and now we are at a point where we generated such a significant amount of data that we landed on technologies such as Big Data, Hadoop, Spark and R which enables us to use and consume AI frameworks.
Patrick says that SUSE, as an operating system for data analytics and IoT workloads, has been providing a stable, secure and reliable platform for customers and partners for several years. We’ve been partnering with leading data platform providers for decades and have thousands of certifications including benchmarks with Microsoft SQL Server, DB2 and SAP.
As data growth exploded and companies are looking for nuance in a sea of data, we continue to emerge as a platform of choice. SUSE Linux Enterprise Server is embedded in solutions like Teradata, certified for MapR, MariaDB etc. and we collaborate with HPE and Lenovo in a number of initiatives such as MS SQL benchmarks and architectures.
Microservices bring the application to the data (as opposed to ELT) and this is one of the benefits that Linux provides to a modern data platform. A lot of open source technologies mentioned here are native to Linux and Microsoft made a strategic decision to NOT port it to Windows. They worked with the Linux and open source community instead, bringing the workload engine (MS SQL) to the platform and to the data, where that data resides.
So, if you think you are ready and in need of embracing microservices and an analytics solution like Apache Spark running on Hadoop, consolidating on a platform could help you save money – SQL Server 2019 running on SUSE is a great answer to that.
That’s a wrap!
Part 6 ends our series for “SQL Server on Linux Journey” and we hope you enjoyed the ride!
Click to learn more
Read the complete series below:
- SQL on Linux Journey – Part 1: First date
- SQL on Linux Journey – Part 2: On premises success
- SQL on Linux Journey – Part 3: Azure options
- SQL on Linux Journey – Part 4: High Availability
- SQL on Linux Journey – Part 5: Containers
- SQL on Linux Journey – Part 6: Big Data Clusters (this blog)