TechGyaan

Thursday, 25 June 2020

Using Polybase to load data into Azure Synapse Analytics (SQL DW) from Blob Storage

Azure Synapse Analytics supports many data loading methods. The fastest and most scalable way to load is through Polybase technology. Polybase is a technology that access external data stored in Azure blob storage , Hadoop or Azure data lake via T-SQL language.

Polybase create an external table in Azure synapse analytics (Azure DW) that references the data stored in Azure blob storage or in Azure data lake - by giving structure to the unstructured data store in blob or in azure data lake store. Polybase external table directly accesses the data in parallel, bypassing the control node and bringing the data directly in to the compute node.

Follow these steps to implement Polybase to load data :

Create a Master Key

It is required to encrypt the credential secret, Only necessary if one does not already exist

CREATE MASTER KEY;

Create a Database Scope Credential

Credential name can not start with the number (#) sign. System credentials start with (##).

IDENTITY - Provide any string, Identity name specifies the name of account to be used when connecting outside the server. To import a file from Azure Blob storage using shared key, the identity name must be SHARED ACCESS SIGNATURE

SECRET - It specifies the secret required for outgoing authentication. Secret is required to import a file from Azure Blob storage, Provide your Azure Storage Account Key.

CREATE DATABASE SCOPED CREDENTIAL credential_name

       WITH
IDENTITY   = identity_name
SECRET    =  [Azure Storage Key]
           ;

Create an External Data Source

TYPE - Polybase uses HADOOP APIs to access data in Azure Blob storage, use HADOOP when the data source is Azure Blob storage, ADLS Gen1 or ADLS Gen2.

TYPE = HADOOP

LOCATION - Provide Azure storage account name and blob container name

LOCATION = '<prefix>://<path[:port]>'

External Data Source Location Perfix Location Path
Azure Blob Storage wasb[s] <container>@<storage_account>.blob.core.windows.net
Azure Data Lake Store Gen 1 adl <storage_account>.azuredatalake.net
Azure Data Lake Store Gen 2 abfs[s] <container>@<storage_account>.dfs.core.window.net

CREDENTIAL - Provide the credential created in the previous step. Credential is only required if the blob has been secured. CREDENTIAL is not required for data sets that allow anonymous access

CREATE EXTERNAL DATA SOURCE datasource_name

WITH (

TYPE = HADOOP,

LOCATION = 'wasbs://<conatiner>@<storage_account.blob.core.windows.net>'

CREDENTIAL = credential_name

);

Create an External File Format

FORMAT_TYPE - Type of file format in Azure storage, supported file formats are DELIMITEDTEXT, RCFILE,ORC,PARQUET etc.

FORMAT_OPTIONS - Specify filed terminator, string delimiter , date format etc. for delimited text files.

DATA_COMPRESSION - Specify DATA_COMPRESSION method if data is compressed. Polybase can processed compressed files.

CREATE EXTERNAL FILE FORMAT fileformat_name

WITH (

FORMAT_TYPE= DELIMITEDTEXT,

FORMAT_OPTIONS (

FIELD_TERMINATOR = ',' ,

STRING_DELIMITER = '"' ,

FIRST_ROW = 2,

DATE_FORMAT = 'dd/MM/yyyy' ,

USE_TYPE_DEFAULT = FALSE

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'

);

Create the External Table

Specify column names and data types, this needs to match the data in the sample file.

LOCATION - Specify path to file or directory that contains the data (blob container). To point to all files under the blob container , use LOCATION = '.'

REJECT Options - It shows how Polybase handle dirty records it retrieves from external data source. A data record is considered 'dirty' if it actual data types or the number of columns do not match the column definitions of the external table.

CREATE EXTERNAL TABLE DBO.EXT_TABLE (

Column_name Data_type

,Column_name1 Data_type

,Column_name2 Data_type

)

WITH (

LOCATION= 'file_name.csv',

DATA_SOURCE = datasource_name,

FILE_FORMAT = fileformat_name,

REJECT_TYPE = PERCENTAGE,

REJECT_VALUE = 10 ,

REJECT_SAMPLE_VALUE = 1000,

REJECTED_ROW_LOCATION = '\REJECT_Directory'

);

Now, the select statement will read data from blob storage using external object which we have created

SELECT column_name,column_name1 ....

FROM dbo.ext_table

Limitations :

Only these DDL (Data Definition Language) statements are allowed on external tables:

CREATE TABLE and DROP TABLE
CREATE STATISTICS and DROP STATISTICS
CREATE VIEW and DROP VIEW

DML (Data Manipulation Language) operations are not supported (i.e. delete, insert and update)

Query limitations - Polybase can consume a maximum of 33k files per folder when running 32 concurrent queries, it includes both files and sub-folders in each HDFS folder. If the degree of concurrency is less than 32, a user can run Polybase queries against folders in HDFS that contains more than 33k files

Table width limitations - Polybase is Azure synapse has a row width limit 1 MB based on the maximum size of a single valid row by table definition. If the sum of column schema is greater than 1 MB, Polybase can not query the data from storage.

Tuesday, 23 June 2020

SSAS Multidimensional and Tabular Models

SQL Server Analysis Server provides several approaches for creating a Business Intelligence semantic Model : Tabular and Multidimensional. The amount and type of data you need to import can be primary consideration when deciding which model type best fits your data. Tabular and Multidimensional models use imported data from external sources. SSAS Tabular Model uses advanced technology, introduced by Microsoft with SQL Server 2012

Multidimensional Model :

Multidimensional model uses OLAP modelling standards such as cubes, dimensions and facts
It uses MOLAP, ROLAP or HOLAP storage for modelling
Multidimensional model databases can use dimensions and cell - level security, using role-based permissions
In multidimensional models, All logic and calculations use MDX (Multi-Dimensional Expressions)
Multidimensional Models are useful in case of large amount of datasets (in terabytes), Many-to-Many relationships, require complex modeling and write - back support.
Multidimensional Models require plenty of high speed disks because it stores large of amount of data

Tabular Model :

Tabular Model are in-memory databases, uses relational modelling standards such as tables and relationships
Tabular Model uses xVelocity/Vertipaq mechanism, which is much faster than Multidimensional model
Tabular Model databases can use row-level security, using role based permissions
In Tabular Models, business logics are created in DAX (Data Analysis Expressions) , It is very easy as compare to MDX and very much Excel-like functions designed for business requirements
Only tabular models are supported in Azure Analysis Service
Tabular Model is not for a system that are going to load terabytes of data
Since tabular model uses in-memory technology , need plenty of RAM as the entire database sits in the memory. It has more efficient data compression as compare to Multidimensional models

The following table summarizes features availability at the model level :

	Multidimensional	Tabular
Modeling	Multi-Dimensional (Dimensional Modeling)	Relational Modeling
Storage	MOLAP, ROLAP, HOLAP	In-memory and Direct Query
Query	MDX (Multidimensional Expressions)	DAX (Data Analysis Expressions)
Structure	Rigid Structure	Loose Structure
Complexity	Complex	Simple
Size	Larger	Small
Aggregations	Yes	No
Calculated Columns	No	Yes
Hierarchies	Yes	Yes
KPIs	Yes	Yes
Partitions	Yes	Yes
Perspectives	Yes	Yes

Saturday, 20 April 2019

SSIS : Execute SSIS Packages with T-SQL using SSISDB Catalog

There are several ways to execute SSIS packages that have been deployed to SSISDB catalog. One of method is using T-SQL statement or SQL stored procedures. It is the most flexible way to execute SSIS catalog deployed packages. There are multiple scenarios where it is very handy or helpful to run SSIS package from T-SQL quires or SQL stored procedures.

The following steps will cover the requirement :

1. Connect to SSMS and open SSIS catalog database, check the required SSIS package :

2. Right Click on SSIS package and click on execute :

3. Now, Need to select the Parameter used in SSIS package, click on three dots (...) belongs to parameter(s) and input the require value :

4. After parameter inputs, click on script (Top Left) to generate the following script :

The above code can be copied from here :

USE [SSISDB]

--This will create a execution id for package run

DECLARE @execution_id BIGINT

EXEC [SSISDB].[catalog].[Create_execution]

@package_name=N'Test_Pkg.dtsx',

@execution_id=@execution_id output,

@folder_name=N'Test',

@project_name=N'SSIS_Package_Run_Using_T-sql',

@use32bitruntime=false,

@reference_id=NULL

SELECT @execution_id

--This will set the parameter values to used in SSIS package

DECLARE @var0 INT = 0

EXEC [SSISDB].[catalog].[Set_execution_parameter_value]

@execution_id,

@object_type=20,

@parameter_name=N'Project_Param',

@parameter_value=@var0

DECLARE @var1 SMALLINT = 1

EXEC [SSISDB].[catalog].[Set_execution_parameter_value]

@execution_id,

@object_type=50,

@parameter_name=N'LOGGING_LEVEL',

@parameter_value=@var1

--Final Store procedure to execute the ssis package

EXEC [SSISDB].[catalog].[Start_execution]

@execution_id

Conclusion :

In this post, I have demonstrated how to run catalog deployed SSIS package using T-SQL statements. We can develop a stored proc to fulfill requirements and run SSIS packages using T-SQL.

Tuesday, 31 December 2013

SQL SERVER : Find the Size of Database file and Log file

Data and log information are never mixed on the same file, and individual files are used only by one database.

The database files (Primary / Secondary) are used for storing data. For each database file there will be a Log file associated, this is the transaction log file used to store the user transactions.

Please find the below code and Screenshot for finding the size of database and log file:

Code :

SELECT DB_NAME (database_id) AS Database_Name, Name AS Logical_Name,
Physical_Name, (size*8)/1024 SizeMB
FROM sys.master_files
WHERE DB_NAME(database_id) = <databasename>

Scrennshot :

Thursday, 26 December 2013

SQL SERVER : Find out Nth Largest value

It is very common question for interview, to find out the 2nd or Nth largest salary of employee.

Please find the below methods to find out the answer :

Method 1 :

SELECT TOP 1 [Column_Name] FROM [Table_Name]
WHERE [Column_Name] NOT IN (SELECT DISTINCT TOP (n-1) [Column_Name] FROM [Table_Name] ORDER BY [Column_Name] DESC )
ORDER BY [Coumn_Name] DESC

Method 2 :

SELECT * FROM (SELECT *, DENSE_RANK() OVER(ORDER BY [Column_Name] DESC ) AS Name FROM [Table_Name]) DQ
WHERE DQ.Name = n

*n = for nth largest

SQL SERVER : Difference Between UNIQUE KEY and PRIMARY KEY

PRIMARY Key and UNIQUE Key enforce the Uniqueness of the values on the column[columns] on which it is defined.

By Default, UNIQUE Key creates a NON CLUSTERED INDEX while PRIMARY Key creates a CLUSTERED INDEX
You can create multiple UNIQUE Keys on a table but you can have ONLY One PRIMARY Key on a table.
PRIMARY Key can be used in Foreign key relationship where as UNIQUE Key can not be used.
PRIMARY Keys can not have nulls but UNIQUE Key allows null value (But only one null value).

Sunday, 3 November 2013

SQL SERVER : SP_WHO and SP_WHO2

There is very useful system stored procedure called sp_who on Sql server that lists users and processes running on the system. It comes handy when you want to see if any process is blocked. Both sp_who and sp_who2 provides information about the current processes / sessions / users in sql server instances. sp_who2 provides some more information as compare to the sp_who.

SP_WHO : SP_WHO provides information like logged in user id, session id, host name ,session status, blocking process details , database name etc.

SP_WHO2 : SP_WHO2 provides some more information - CPU time for each session, Disk IO, Last Batch , Program name.

As such there is no difference in both SPs but the major one is that SP_WHO2 is not a documented it is an undocumented stored procedure.

Above two SPs can be used to monitor the server connections. Also these can be used to check the connections created for specific login/specific session /active session. You can filter the data returned to show only those processes that are not idle. if you specify a particular user like sp_who [@login_name=]'Login' then you get processes for that user , otherwise all processes for all users will be listed .

External Data Source	Location Perfix	Location Path
Azure Blob Storage	wasb[s]	<container>@<storage_account>.blob.core.windows.net
Azure Data Lake Store Gen 1	adl	<storage_account>.azuredatalake.net
Azure Data Lake Store Gen 2	abfs[s]	<container>@<storage_account>.dfs.core.window.net