With the release of SQL Server 2019, Microsoft have made some exciting changes to its data virtualisation feature; PolyBase External Tables. With these enhancements, organisations can position SQL Server as the central system for querying nearly all their structured and unstructured data spread across a multitude of different vendors, databases and files, without the requirement to develop complex integrations or extraction processes.
Data Virtualisation using PolyBase External Tables
Nearly all organisations have multiple sources of data, usually scattered across various disparate systems, applications, hosting locations and legacy database systems. These data pools hold valuable insights for a business and being able to link these systems for complete analysis quickly, has always been tricky. Traditionally this would mean developing complex Extraction, Transformation and Load (ETL) processes which make centralising the data costly, time consuming and prone to error, and trying to query data in near real time in this way has never been an easy task. Linked servers can be used to allow an SQL Server to query an external data set, but this process has performance limitations, complex naming notations and if a database is migrated or is part of an Availability Group, then extra administration is required.
With SQL Server 2019, these problems are a lot less challenging. By using PolyBase External Tables, table definitions to external data sources can be created within a SQL Server database. This means to the client, the tables look like they are objects within the database itself, despite queries accessing external data stored on a different system.
The instant benefit of this is that if any client, BI tool or application can access SQL Server already, then it can also query multiple disparate data sources as well, with no changes to the application layer required. This feature was first released with SQL Server 2016, and provided access to Hadoop and Azure Blob Storage, natively in SQL Server. With SQL Server 2019, this functionality is extended even further to provide native access to Oracle, MongoDB, Teradata, unstructured data and legacy SQL Servers. In fact, if there is an ODBC driver for the data source available, then SQL Server can utilise this to create external tables, meaning pretty much any data source can be queried in real time simple using T-SQL.
Implementing PolyBase External Tables couldn’t be easier. With SQL Server 2019, this feature is now available in Standard as well as Enterprise editions. This is one of several previously Enterprise level features Microsoft has provided in the Standard edition of SQL Server 2019, reducing the cost restrictions faced by many organisations when trying to implement the latest technology.
PolyBase External Tables can natively query Big Data sources, which could mean the use of a single SQL Server would cause a performance bottleneck. For this reason, there is the option to scale PolyBase to use multiple compute nodes in a PolyBase Scale Out Group. This is useful for large intensive queries of Hadoop and Azure Blob Storage, by sharing the query workload across multiple SQL Server instances.
Once the PolyBase feature is installed and enabled, it’s just a simple case of creating an external data source, defining the schema of the external objects and then querying the data with T-SQL. These actions can all be done using T-SQL or via a dedicated GUI, available in Azure Data Studio with the Data Virtualisation extension installed.
Are there any Pitfalls?
The benefits of PolyBase External Tables to quickly and easily connect multiple sources of data to nearly any application, with very little cost, should be enough reason to consider ditching traditional ETL process. However, depending on where and what the external data source is, there may be the requirement to still extract and query data separately. Traditional ETL process usually duplicate data from a remote source to a local destination, meaning the initial load of data can be slow, but the querying of data can be tuned and manipulated for a different model than what is located at the source. With PolyBase External Tables, query performance can be affected by network latency and the underlying table structure of the remote system. This is because external tables don’t support separate indexes meaning the query patterns being executed from the central system may be completely different to the index designs on the remote system.
Summary of Key Benefits
- Quickly and easily join disparate data sources within your organisation, across on-premises or the cloud and connect existing applications or analytic tools with no need for any complex code changes.
- There is no longer the need to develop costly and time consuming ETL processes, which duplicate data and restrict the ability to query data in real time.
- PolyBase External Tables are defined at the database level, so any SQL Server upgrades, or HA/DR implementations mean the objects move with the data. Create the environment once and it is available wherever the database resides.
- Scale large query processing on Hadoop or Azure Blob Storage with a PolyBase Scale Out Group.
- Available in SQL Server 2019 Standard.
Written by David Heath, Senior Technical Consultant
An intelligent financial services firm is nimble, slick and clever. It out-performs its competitors by out-innovating them...
Successful execution of strategy requires alignment in every part of the business. Pockets of excellence are not enough. You need intelligence at every level – from the foundations up...
An intelligent business is nimble, slick and clever. It out-performs its competitors by out-innovating them. And it’s built on intelligent infrastructure from the foundations up...