Thursday, 25 June 2020

Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polybase is a technology that access external data stored in Azure blob storage , Hadoop or Azure data lake via T-SQL language.

Polybase create an external table in Azure synapse analytics (Azure DW) that references the data stored in Azure blob storage or in Azure data lake - by giving structure to the unstructured data store in blob or in azure data lake store. Polybase external table directly accesses the data in parallel, bypassing the control node and bringing the data directly in to the compute node. 


Follow these steps to implement Polybase to load data :

Create a Master Key 

It is required to encrypt the credential secret, Only necessary if one does not already exist 

       CREATE MASTER KEY;    

Create a Database Scope Credential 

Credential name can not start with the number (#) sign. System credentials start with (##).

IDENTITY - Provide any string, Identity name specifies the name of account to be used when connecting outside the server. To import a file from Azure Blob storage using shared key, the identity name must be SHARED ACCESS SIGNATURE

SECRET - It specifies the secret required for outgoing authentication. Secret is required to import a file from Azure Blob storage, Provide your Azure Storage Account Key. 

        CREATE  DATABASE SCOPED CREDENTIAL credential_name 
        WITH
             IDENTITY    identity_name
             SECRET      =  
[Azure Storage Key]
            ;

Create an External Data Source 

TYPE - Polybase uses HADOOP APIs to access data in Azure Blob storage, use HADOOP when the data source is Azure Blob storage, ADLS Gen1 or ADLS Gen2.  
            TYPE = HADOOP

LOCATION - Provide Azure storage account name and blob container name 
            LOCATION = '<prefix>://<path[:port]>'
       
 External Data Source  Location Perfix  Location Path 
 Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net
 Azure Data Lake Store Gen 1 adl <storage_account>.azuredatalake.net
 Azure Data Lake Store Gen 2  abfs[s] <container>@<storage_account>.dfs.core.window.net

CREDENTIAL - Provide the credential created in the previous step. Credential is only required if the blob has been secured. CREDENTIAL is not required for data sets that allow anonymous access 

       CREATE EXTERNAL DATA SOURCE datasource_name
     WITH (
        TYPE = HADOOP,
          LOCATION = 'wasbs://<conatiner>@<storage_account.blob.core.windows.net>'  
          CREDENTIAL = credential_name
      );  

Create an External File Format 

FORMAT_TYPE - Type of file format in Azure storage, supported file formats are DELIMITEDTEXT, RCFILE,ORC,PARQUET etc.

FORMAT_OPTIONS - Specify filed terminator, string delimiter , date format etc. for delimited text files. 

DATA_COMPRESSION - Specify DATA_COMPRESSION method if data is compressed. Polybase can processed compressed files. 

        CREATE EXTERNAL FILE FORMAT fileformat_name
        WITH (
            FORMAT_TYPE= DELIMITEDTEXT,
            FORMAT_OPTIONS (
                        FIELD_TERMINATOR = ','  ,
                        STRING_DELIMITER  = '"' ,
                        FIRST_ROW                  = 2,
                        DATE_FORMAT            = 'dd/MM/yyyy' ,
                        USE_TYPE_DEFAULT = FALSE
            ),
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
    );


Create the External Table 

Specify column names and data types, this needs to match the data in the sample file.
LOCATION - Specify path to file or directory that contains the data (blob container). To point to all files under the blob container , use LOCATION = '.'

REJECT Options - It shows how Polybase handle dirty records it retrieves from external data source. A data record is considered 'dirty' if it actual data types or the number of columns do not match the column definitions of the external table. 

    CREATE EXTERNAL TABLE DBO.EXT_TABLE (
                Column_name         Data_type
                ,Column_name1      Data_type
                ,Column_name2      Data_type        
            )
    WITH (
        LOCATION= 'file_name.csv',    
        DATA_SOURCE = datasource_name,
        FILE_FORMAT = fileformat_name,
        REJECT_TYPE = PERCENTAGE,
        REJECT_VALUE = 10 ,
        REJECT_SAMPLE_VALUE = 1000,
        REJECTED_ROW_LOCATION = '\REJECT_Directory'
    );


Now, the select statement will read data from blob storage using external object which we have created 

SELECT column_name,column_name1 ....
FROM dbo.ext_table


Limitations : 
  • Only these DDL (Data Definition Language) statements are allowed on external tables: 
    • CREATE TABLE  and DROP TABLE
    • CREATE STATISTICS and DROP STATISTICS
    • CREATE VIEW and DROP VIEW
  • DML (Data Manipulation Language) operations are not supported (i.e. delete, insert and update)
  • Query limitations - Polybase can consume  a maximum of 33k files per folder when running 32 concurrent queries, it includes both files and sub-folders in each  HDFS folder. If the degree of concurrency is less than 32, a user can run Polybase queries against folders in HDFS that contains more than 33k files 
  • Table width limitations - Polybase is Azure synapse has a row width limit 1 MB based on the maximum size of a single valid row by table definition. If the sum of column schema is greater than 1 MB, Polybase can not query the data from storage. 









Tuesday, 23 June 2020

SSAS Multidimensional and Tabular Models

SQL Server Analysis Server provides several approaches for creating a Business Intelligence semantic Model  : Tabular and Multidimensional. The amount and type of data you need to import can be primary consideration when deciding which model type best fits your data. Tabular and Multidimensional models use imported data from external sources. SSAS Tabular Model uses advanced technology, introduced by Microsoft with SQL Server 2012  

Multidimensional Model :

  • Multidimensional model uses OLAP modelling standards such as cubes, dimensions and facts 
  • It uses MOLAP, ROLAP or HOLAP storage for modelling 
  • Multidimensional model databases can use dimensions and cell - level security, using role-based permissions 
  • In multidimensional models, All logic and calculations use MDX (Multi-Dimensional Expressions)
  • Multidimensional Models are useful in case of large amount of datasets (in terabytes), Many-to-Many relationships, require complex modeling and write - back support. 
  • Multidimensional Models require plenty of high speed disks because it stores large of amount of data 

Tabular Model :

  • Tabular Model are in-memory databases, uses relational modelling standards such as tables and relationships 
  • Tabular Model uses xVelocity/Vertipaq mechanism, which is much faster than Multidimensional model
  • Tabular Model databases can use row-level security, using role based permissions
  • In Tabular Models, business logics are created in DAX (Data Analysis Expressions) , It is very easy as compare to MDX and very much Excel-like functions designed for business requirements
  • Only tabular models are supported in Azure Analysis Service
  • Tabular Model is not for a system that are going to load terabytes of data 
  • Since tabular model uses in-memory technology , need plenty of RAM as the entire database sits in the memory. It has more efficient data compression as compare to Multidimensional models 
The following table summarizes features availability at the model level :
     
  

 Multidimensional 

 Tabular 
 Modeling 
  Multi-Dimensional 
 (Dimensional Modeling)
 Relational Modeling 
 Storage  MOLAP, ROLAP, HOLAP In-memory and Direct Query
 Query  MDX
(Multidimensional Expressions)         
DAX   
(Data Analysis Expressions)        
 Structure Rigid Structure          Loose Structure 
 Complexity  Complex  Simple 
Size Larger  Small 
Aggregations  Yes  No 
Calculated Columns  No  Yes 
Hierarchies  Yes Yes 
 KPIs Yes  Yes 
 Partitions Yes  Yes 
 Perspectives Yes  Yes 


Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polyba...