Thursday, 25 June 2020

Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polybase is a technology that access external data stored in Azure blob storage , Hadoop or Azure data lake via T-SQL language.

Polybase create an external table in Azure synapse analytics (Azure DW) that references the data stored in Azure blob storage or in Azure data lake - by giving structure to the unstructured data store in blob or in azure data lake store. Polybase external table directly accesses the data in parallel, bypassing the control node and bringing the data directly in to the compute node. 


Follow these steps to implement Polybase to load data :

Create a Master Key 

It is required to encrypt the credential secret, Only necessary if one does not already exist 

       CREATE MASTER KEY;    

Create a Database Scope Credential 

Credential name can not start with the number (#) sign. System credentials start with (##).

IDENTITY - Provide any string, Identity name specifies the name of account to be used when connecting outside the server. To import a file from Azure Blob storage using shared key, the identity name must be SHARED ACCESS SIGNATURE

SECRET - It specifies the secret required for outgoing authentication. Secret is required to import a file from Azure Blob storage, Provide your Azure Storage Account Key. 

        CREATE  DATABASE SCOPED CREDENTIAL credential_name 
        WITH
             IDENTITY    identity_name
             SECRET      =  
[Azure Storage Key]
            ;

Create an External Data Source 

TYPE - Polybase uses HADOOP APIs to access data in Azure Blob storage, use HADOOP when the data source is Azure Blob storage, ADLS Gen1 or ADLS Gen2.  
            TYPE = HADOOP

LOCATION - Provide Azure storage account name and blob container name 
            LOCATION = '<prefix>://<path[:port]>'
       
 External Data Source  Location Perfix  Location Path 
 Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net
 Azure Data Lake Store Gen 1 adl <storage_account>.azuredatalake.net
 Azure Data Lake Store Gen 2  abfs[s] <container>@<storage_account>.dfs.core.window.net

CREDENTIAL - Provide the credential created in the previous step. Credential is only required if the blob has been secured. CREDENTIAL is not required for data sets that allow anonymous access 

       CREATE EXTERNAL DATA SOURCE datasource_name
     WITH (
        TYPE = HADOOP,
          LOCATION = 'wasbs://<conatiner>@<storage_account.blob.core.windows.net>'  
          CREDENTIAL = credential_name
      );  

Create an External File Format 

FORMAT_TYPE - Type of file format in Azure storage, supported file formats are DELIMITEDTEXT, RCFILE,ORC,PARQUET etc.

FORMAT_OPTIONS - Specify filed terminator, string delimiter , date format etc. for delimited text files. 

DATA_COMPRESSION - Specify DATA_COMPRESSION method if data is compressed. Polybase can processed compressed files. 

        CREATE EXTERNAL FILE FORMAT fileformat_name
        WITH (
            FORMAT_TYPE= DELIMITEDTEXT,
            FORMAT_OPTIONS (
                        FIELD_TERMINATOR = ','  ,
                        STRING_DELIMITER  = '"' ,
                        FIRST_ROW                  = 2,
                        DATE_FORMAT            = 'dd/MM/yyyy' ,
                        USE_TYPE_DEFAULT = FALSE
            ),
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
    );


Create the External Table 

Specify column names and data types, this needs to match the data in the sample file.
LOCATION - Specify path to file or directory that contains the data (blob container). To point to all files under the blob container , use LOCATION = '.'

REJECT Options - It shows how Polybase handle dirty records it retrieves from external data source. A data record is considered 'dirty' if it actual data types or the number of columns do not match the column definitions of the external table. 

    CREATE EXTERNAL TABLE DBO.EXT_TABLE (
                Column_name         Data_type
                ,Column_name1      Data_type
                ,Column_name2      Data_type        
            )
    WITH (
        LOCATION= 'file_name.csv',    
        DATA_SOURCE = datasource_name,
        FILE_FORMAT = fileformat_name,
        REJECT_TYPE = PERCENTAGE,
        REJECT_VALUE = 10 ,
        REJECT_SAMPLE_VALUE = 1000,
        REJECTED_ROW_LOCATION = '\REJECT_Directory'
    );


Now, the select statement will read data from blob storage using external object which we have created 

SELECT column_name,column_name1 ....
FROM dbo.ext_table


Limitations : 
  • Only these DDL (Data Definition Language) statements are allowed on external tables: 
    • CREATE TABLE  and DROP TABLE
    • CREATE STATISTICS and DROP STATISTICS
    • CREATE VIEW and DROP VIEW
  • DML (Data Manipulation Language) operations are not supported (i.e. delete, insert and update)
  • Query limitations - Polybase can consume  a maximum of 33k files per folder when running 32 concurrent queries, it includes both files and sub-folders in each  HDFS folder. If the degree of concurrency is less than 32, a user can run Polybase queries against folders in HDFS that contains more than 33k files 
  • Table width limitations - Polybase is Azure synapse has a row width limit 1 MB based on the maximum size of a single valid row by table definition. If the sum of column schema is greater than 1 MB, Polybase can not query the data from storage. 









Tuesday, 23 June 2020

SSAS Multidimensional and Tabular Models

SQL Server Analysis Server provides several approaches for creating a Business Intelligence semantic Model  : Tabular and Multidimensional. The amount and type of data you need to import can be primary consideration when deciding which model type best fits your data. Tabular and Multidimensional models use imported data from external sources. SSAS Tabular Model uses advanced technology, introduced by Microsoft with SQL Server 2012  

Multidimensional Model :

  • Multidimensional model uses OLAP modelling standards such as cubes, dimensions and facts 
  • It uses MOLAP, ROLAP or HOLAP storage for modelling 
  • Multidimensional model databases can use dimensions and cell - level security, using role-based permissions 
  • In multidimensional models, All logic and calculations use MDX (Multi-Dimensional Expressions)
  • Multidimensional Models are useful in case of large amount of datasets (in terabytes), Many-to-Many relationships, require complex modeling and write - back support. 
  • Multidimensional Models require plenty of high speed disks because it stores large of amount of data 

Tabular Model :

  • Tabular Model are in-memory databases, uses relational modelling standards such as tables and relationships 
  • Tabular Model uses xVelocity/Vertipaq mechanism, which is much faster than Multidimensional model
  • Tabular Model databases can use row-level security, using role based permissions
  • In Tabular Models, business logics are created in DAX (Data Analysis Expressions) , It is very easy as compare to MDX and very much Excel-like functions designed for business requirements
  • Only tabular models are supported in Azure Analysis Service
  • Tabular Model is not for a system that are going to load terabytes of data 
  • Since tabular model uses in-memory technology , need plenty of RAM as the entire database sits in the memory. It has more efficient data compression as compare to Multidimensional models 
The following table summarizes features availability at the model level :
     
  

 Multidimensional 

 Tabular 
 Modeling 
  Multi-Dimensional 
 (Dimensional Modeling)
 Relational Modeling 
 Storage  MOLAP, ROLAP, HOLAP In-memory and Direct Query
 Query  MDX
(Multidimensional Expressions)         
DAX   
(Data Analysis Expressions)        
 Structure Rigid Structure          Loose Structure 
 Complexity  Complex  Simple 
Size Larger  Small 
Aggregations  Yes  No 
Calculated Columns  No  Yes 
Hierarchies  Yes Yes 
 KPIs Yes  Yes 
 Partitions Yes  Yes 
 Perspectives Yes  Yes 


Saturday, 20 April 2019

SSIS : Execute SSIS Packages with T-SQL using SSISDB Catalog

There are several ways to execute SSIS packages that have been deployed to SSISDB catalog. One of method is using T-SQL statement or SQL stored procedures. It is the most flexible way to execute SSIS catalog deployed packages. There are multiple scenarios where it is very handy or helpful  to run SSIS package from T-SQL quires or SQL stored procedures.


The following steps will cover the requirement :

1. Connect to SSMS and open SSIS catalog database, check the required SSIS package :



2. Right Click on SSIS package and click on execute :

3. Now, Need to select the Parameter used in SSIS package, click on three dots (...) belongs to parameter(s)  and input the require value :



4. After parameter inputs, click on script (Top Left) to generate the following script :


The above code can be copied from here :

USE [SSISDB] 

--This will create a execution id for package run 
DECLARE @execution_id BIGINT 
EXEC [SSISDB].[catalog].[Create_execution] 
  @package_name=N'Test_Pkg.dtsx', 
  @execution_id=@execution_id output, 
  @folder_name=N'Test', 
  @project_name=N'SSIS_Package_Run_Using_T-sql', 
  @use32bitruntime=false, 
  @reference_id=NULL 

SELECT @execution_id 

--This will set the parameter values to used in SSIS package 
DECLARE @var0 INT = 0 
EXEC [SSISDB].[catalog].[Set_execution_parameter_value] 
  @execution_id, 
  @object_type=20, 
  @parameter_name=N'Project_Param', 
  @parameter_value=@var0 

DECLARE @var1 SMALLINT = 1
EXEC [SSISDB].[catalog].[Set_execution_parameter_value] 
  @execution_id, 
  @object_type=50, 
  @parameter_name=N'LOGGING_LEVEL', 
  @parameter_value=@var1 

--Final Store procedure to execute the ssis package  
EXEC [SSISDB].[catalog].[Start_execution] 
  @execution_id 

GO

Conclusion : 

In this post, I have demonstrated how to run catalog deployed SSIS package using T-SQL statements. We can develop a stored proc to fulfill requirements and run SSIS packages using T-SQL.

Tuesday, 31 December 2013

SQL SERVER : Find the Size of Database file and Log file

Data and log information are never mixed on the same file, and individual files are used only by one database.

The database files (Primary / Secondary) are used for storing data. For each database file there will be a Log file associated, this is the transaction log file used to store the user transactions.

Please find the below code and Screenshot for finding the size of database and log file:

Code :

SELECT DB_NAME (database_id) AS Database_Name, Name AS Logical_Name,
Physical_Name, (size*8)/1024 SizeMB
FROM sys.master_files
WHERE DB_NAME(database_id) = <databasename>


Scrennshot : 



Thursday, 26 December 2013

SQL SERVER : Find out Nth Largest value

It is very common question for interview, to find out the 2nd or Nth largest salary of employee.

Please find the below methods to find out the answer :

Method 1 :

SELECT TOP 1 [Column_Name] FROM [Table_Name] 
WHERE [Column_Name] NOT IN (SELECT DISTINCT TOP (n-1) [Column_Name] FROM  [Table_Name] ORDER BY [Column_Name] DESC )
ORDER BY [Coumn_Name] DESC

Method 2 :

SELECT * FROM (SELECT *, DENSE_RANK() OVER(ORDER BY [Column_Name] DESC ) AS Name FROM [Table_Name]) DQ
WHERE DQ.Name = n

*n = for nth largest

SQL SERVER : Difference Between UNIQUE KEY and PRIMARY KEY

PRIMARY Key and UNIQUE Key enforce the Uniqueness of the values on the column[columns] on which it is defined.


  •  By Default, UNIQUE Key creates a NON CLUSTERED INDEX while PRIMARY Key creates a   CLUSTERED  INDEX 
  • You can create multiple UNIQUE Keys on a table but you can have ONLY One PRIMARY Key on a table.
  • PRIMARY Key can be used in Foreign key relationship where as UNIQUE Key can not be used.
  • PRIMARY Keys can not have nulls but UNIQUE Key allows null value (But only one null value). 


Sunday, 3 November 2013

SQL SERVER : SP_WHO and SP_WHO2

There is very useful system stored procedure called sp_who on Sql server that lists users and processes running on the system. It comes handy when you want to see if any process is blocked. Both sp_who and sp_who2 provides information about the current processes / sessions / users in sql server instances. sp_who2 provides some more information as compare to the sp_who.

SP_WHO : SP_WHO provides information like logged in user id, session id, host name ,session status, blocking process details , database name etc.

SP_WHO2 : SP_WHO2 provides some more information - CPU time for each session, Disk IO, Last Batch , Program name.
















As such there is no difference in both SPs but the major one is that SP_WHO2 is not a documented it is an undocumented stored procedure.

Above two SPs can be used to monitor the server connections. Also these can be used to check the connections created for specific login/specific session /active session. You can filter the data returned to show only those processes that are not idle. if you specify a particular user like sp_who  [@login_name=]'Login' then you get processes for that user , otherwise all processes for all users will be listed .





Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polyba...