Polybase create an external table in Azure synapse analytics (Azure DW) that references the data stored in Azure blob storage or in Azure data lake - by giving structure to the unstructured data store in blob or in azure data lake store. Polybase external table directly accesses the data in parallel, bypassing the control node and bringing the data directly in to the compute node.
Follow these steps to implement Polybase to load data :
Create a Master Key
It is required to encrypt the credential secret, Only necessary if one does not already exist
CREATE MASTER KEY;
Create a Database Scope Credential
Credential name can not start with the number (#) sign. System credentials start with (##).
IDENTITY - Provide any string, Identity name specifies the name of account to be used when connecting outside the server. To import a file from Azure Blob storage using shared key, the identity name must be SHARED ACCESS SIGNATURE
SECRET - It specifies the secret required for outgoing authentication. Secret is required to import a file from Azure Blob storage, Provide your Azure Storage Account Key.
CREATE DATABASE SCOPED CREDENTIAL credential_name
WITH
IDENTITY = identity_name
SECRET = [Azure Storage Key]
;
IDENTITY = identity_name
SECRET = [Azure Storage Key]
;
Create an External Data Source
TYPE - Polybase uses HADOOP APIs to access data in Azure Blob storage, use HADOOP when the data source is Azure Blob storage, ADLS Gen1 or ADLS Gen2.
TYPE = HADOOP
LOCATION - Provide Azure storage account name and blob container name
LOCATION = '<prefix>://<path[:port]>'
External Data Source | Location Perfix | Location Path |
Azure Blob Storage | wasb[s] | <container>@<storage_account>.blob.core.windows.net |
Azure Data Lake Store Gen 1 | adl | <storage_account>.azuredatalake.net |
Azure Data Lake Store Gen 2 | abfs[s] | <container>@<storage_account>.dfs.core.window.net |
CREDENTIAL - Provide the credential created in the previous step. Credential is only required if the blob has been secured. CREDENTIAL is not required for data sets that allow anonymous access
CREATE EXTERNAL DATA SOURCE datasource_name
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<conatiner>@<storage_account.blob.core.windows.net>'
CREDENTIAL = credential_name
);
Create an External File Format
FORMAT_TYPE - Type of file format in Azure storage, supported file formats are DELIMITEDTEXT, RCFILE,ORC,PARQUET etc.
FORMAT_OPTIONS - Specify filed terminator, string delimiter , date format etc. for delimited text files.
DATA_COMPRESSION - Specify DATA_COMPRESSION method if data is compressed. Polybase can processed compressed files.
CREATE EXTERNAL FILE FORMAT fileformat_name
WITH (
FORMAT_TYPE= DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',' ,
STRING_DELIMITER = '"' ,
FIRST_ROW = 2,
DATE_FORMAT = 'dd/MM/yyyy' ,
USE_TYPE_DEFAULT = FALSE
),
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
);
Create the External Table
Specify column names and data types, this needs to match the data in the sample file.
LOCATION - Specify path to file or directory that contains the data (blob container). To point to all files under the blob container , use LOCATION = '.'
REJECT Options - It shows how Polybase handle dirty records it retrieves from external data source. A data record is considered 'dirty' if it actual data types or the number of columns do not match the column definitions of the external table.
CREATE EXTERNAL TABLE DBO.EXT_TABLE (
Column_name Data_type
,Column_name1 Data_type
,Column_name2 Data_type
)
WITH (
LOCATION= 'file_name.csv',
DATA_SOURCE = datasource_name,
FILE_FORMAT = fileformat_name,
REJECT_TYPE = PERCENTAGE,
REJECT_VALUE = 10 ,
REJECT_SAMPLE_VALUE = 1000,
REJECTED_ROW_LOCATION = '\REJECT_Directory'
);
Now, the select statement will read data from blob storage using external object which we have created
SELECT column_name,column_name1 ....
FROM dbo.ext_table
Limitations :
- Only these DDL (Data Definition Language) statements are allowed on external tables:
- CREATE TABLE and DROP TABLE
- CREATE STATISTICS and DROP STATISTICS
- CREATE VIEW and DROP VIEW
- DML (Data Manipulation Language) operations are not supported (i.e. delete, insert and update)
- Query limitations - Polybase can consume a maximum of 33k files per folder when running 32 concurrent queries, it includes both files and sub-folders in each HDFS folder. If the degree of concurrency is less than 32, a user can run Polybase queries against folders in HDFS that contains more than 33k files
- Table width limitations - Polybase is Azure synapse has a row width limit 1 MB based on the maximum size of a single valid row by table definition. If the sum of column schema is greater than 1 MB, Polybase can not query the data from storage.