Thursday, 25 June 2020

Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polybase is a technology that access external data stored in Azure blob storage , Hadoop or Azure data lake via T-SQL language.

Polybase create an external table in Azure synapse analytics (Azure DW) that references the data stored in Azure blob storage or in Azure data lake - by giving structure to the unstructured data store in blob or in azure data lake store. Polybase external table directly accesses the data in parallel, bypassing the control node and bringing the data directly in to the compute node. 


Follow these steps to implement Polybase to load data :

Create a Master Key 

It is required to encrypt the credential secret, Only necessary if one does not already exist 

       CREATE MASTER KEY;    

Create a Database Scope Credential 

Credential name can not start with the number (#) sign. System credentials start with (##).

IDENTITY - Provide any string, Identity name specifies the name of account to be used when connecting outside the server. To import a file from Azure Blob storage using shared key, the identity name must be SHARED ACCESS SIGNATURE

SECRET - It specifies the secret required for outgoing authentication. Secret is required to import a file from Azure Blob storage, Provide your Azure Storage Account Key. 

        CREATE  DATABASE SCOPED CREDENTIAL credential_name 
        WITH
             IDENTITY    identity_name
             SECRET      =  
[Azure Storage Key]
            ;

Create an External Data Source 

TYPE - Polybase uses HADOOP APIs to access data in Azure Blob storage, use HADOOP when the data source is Azure Blob storage, ADLS Gen1 or ADLS Gen2.  
            TYPE = HADOOP

LOCATION - Provide Azure storage account name and blob container name 
            LOCATION = '<prefix>://<path[:port]>'
       
 External Data Source  Location Perfix  Location Path 
 Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net
 Azure Data Lake Store Gen 1 adl <storage_account>.azuredatalake.net
 Azure Data Lake Store Gen 2  abfs[s] <container>@<storage_account>.dfs.core.window.net

CREDENTIAL - Provide the credential created in the previous step. Credential is only required if the blob has been secured. CREDENTIAL is not required for data sets that allow anonymous access 

       CREATE EXTERNAL DATA SOURCE datasource_name
     WITH (
        TYPE = HADOOP,
          LOCATION = 'wasbs://<conatiner>@<storage_account.blob.core.windows.net>'  
          CREDENTIAL = credential_name
      );  

Create an External File Format 

FORMAT_TYPE - Type of file format in Azure storage, supported file formats are DELIMITEDTEXT, RCFILE,ORC,PARQUET etc.

FORMAT_OPTIONS - Specify filed terminator, string delimiter , date format etc. for delimited text files. 

DATA_COMPRESSION - Specify DATA_COMPRESSION method if data is compressed. Polybase can processed compressed files. 

        CREATE EXTERNAL FILE FORMAT fileformat_name
        WITH (
            FORMAT_TYPE= DELIMITEDTEXT,
            FORMAT_OPTIONS (
                        FIELD_TERMINATOR = ','  ,
                        STRING_DELIMITER  = '"' ,
                        FIRST_ROW                  = 2,
                        DATE_FORMAT            = 'dd/MM/yyyy' ,
                        USE_TYPE_DEFAULT = FALSE
            ),
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
    );


Create the External Table 

Specify column names and data types, this needs to match the data in the sample file.
LOCATION - Specify path to file or directory that contains the data (blob container). To point to all files under the blob container , use LOCATION = '.'

REJECT Options - It shows how Polybase handle dirty records it retrieves from external data source. A data record is considered 'dirty' if it actual data types or the number of columns do not match the column definitions of the external table. 

    CREATE EXTERNAL TABLE DBO.EXT_TABLE (
                Column_name         Data_type
                ,Column_name1      Data_type
                ,Column_name2      Data_type        
            )
    WITH (
        LOCATION= 'file_name.csv',    
        DATA_SOURCE = datasource_name,
        FILE_FORMAT = fileformat_name,
        REJECT_TYPE = PERCENTAGE,
        REJECT_VALUE = 10 ,
        REJECT_SAMPLE_VALUE = 1000,
        REJECTED_ROW_LOCATION = '\REJECT_Directory'
    );


Now, the select statement will read data from blob storage using external object which we have created 

SELECT column_name,column_name1 ....
FROM dbo.ext_table


Limitations : 
  • Only these DDL (Data Definition Language) statements are allowed on external tables: 
    • CREATE TABLE  and DROP TABLE
    • CREATE STATISTICS and DROP STATISTICS
    • CREATE VIEW and DROP VIEW
  • DML (Data Manipulation Language) operations are not supported (i.e. delete, insert and update)
  • Query limitations - Polybase can consume  a maximum of 33k files per folder when running 32 concurrent queries, it includes both files and sub-folders in each  HDFS folder. If the degree of concurrency is less than 32, a user can run Polybase queries against folders in HDFS that contains more than 33k files 
  • Table width limitations - Polybase is Azure synapse has a row width limit 1 MB based on the maximum size of a single valid row by table definition. If the sum of column schema is greater than 1 MB, Polybase can not query the data from storage. 









No comments:

Post a Comment

Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polyba...