The Importance of New Case-Insensitive Collation i...

AndyDDC · ‎11-18-2024

When Fabric was announced and made available to everyone in the summer of 2023 it represented a seismic shift in simplifying how end-to-end data and analytics solutions can be architected and developed. Key to this is the Fabric Warehouse service, a T-SQL based "software-as-a-service" platform enabling SQL developers the world over to bring their skills to a dedicated Data Warehousing platform. The strength of the Warehouse platform is its auto-scale and fault-tolerance capabilities, whilst the storage layer uses the Parquet file format and Delta Lake - open-source technologies which are opening up new possibilities in the world of Data Warehousing.

Since Fabric has been made available including making General Availability (GA), regular feature updates have rolled out across multiple services including the Warehouse. One such feature we'll be diving into in this blog post is a new case insensitive collation Latin1_General_100_CI_AS_KS_WS_SC_UTF8. There are countless collations in RDBMS products such as SQL Server, why is it an important feature release? Well, we have to look at why only 1 collation Latin1_General_100_BIN2_UTF8 was supported up until now? The reason includes the optimisation of the Parquet file format used in the Fabric Warehouse. Latin1_General_100_BIN2_UTF8 collation is compatible with parquet string sorting rules and allows the Warehouse engine to optimise filtering. However, there is a downside and that is the default collation is case sensitive.

Why do we care about this? Let's go through an example when case sensitivity may be undesirable in object names and in the data itself. In the walkthrough below, a Fabric Warehouse has already been created with the default collation of Latin1_General_100_BIN2_UTF8. We will create new objects and insert data for testing.

Object Names

In the Warehouse, run the following T-SQL script to create 2 new tables. If you notice, the 2 new tables have the same name but have different casings.

--create first table
CREATE TABLE dbo.DimCustomer
(
    CustomerKey INT,
    CustomerName VARCHAR(100)
);

--create second table
CREATE TABLE dbo.dimcustomer
(
CustomerKey INT,
CustomerName VARCHAR(100)
);

An error will appear when running the second table about using different capitalizations. However, if you query INFORMATION_SCHEMA.TABLES both tables have been created. The issue now is that the second table will not show in the list of tables in the Warehouse itself. This may be confusing to any end users or other developers accessing the Warehouse.

Matching Data

Let's now look at an example when using data with different casings. When we JOIN using the ProductName column, because there are different casings in the telemetry table, they are not matched and not returned in the aggregate query.

CREATE TABLE dbo.Raw_Products
(
    ProductID INT,
    ProductName VARCHAR(255),
    ProductModel VARCHAR(50)
);
GO

INSERT INTO dbo.Raw_Products
VALUES
    (1,'Product One','Model B'),
    (2,'Product Two','Model B');

--create web telemetry data
CREATE TABLE dbo.Raw_ProductWebTelemetry
(
    ProductName VARCHAR(255),
    PageViews INT,
    Purchases INT
);
GO

INSERT INTO dbo.Raw_ProductWebTelemetry
VALUES
    ('Product One',10,2),
    ('product One',14,5),
    ('product one',16,3);

--select telemetry and group by model
SELECT
    DP.ProductModel,
    SUM(MD.PageViews) AS TotalPageViews,
    SUM(MD.Purchases) AS TotalPurchases
FROM dbo.Raw_ProductWebTelemetry MD
INNER JOIN dbo.Raw_Products DP On DP.ProductName = MD.ProductName
GROUP BY DP.ProductModel;

Creating a Warehouse with Case-Sensitive Collation

Currently we can create a new Warehouse and specify the collation (more fine-grained methods of working with collations are coming) using the REST API. In this example, I will use Semantic Link in a Notebook to create a new Warehouse using the case-insensitive (CI) collation - Latin1_General_100_CI_AS_KS_WS_SC_UTF8.

In a Notebook, run the following code to create the Warehouse.

#sempy version 0.4.0 or higher
!pip install semantic-link --q 
import json
import sempy.fabric as fabric

--get the current workspace id
workspace_id=spark.conf.get("trident.workspace.id")

#Instantiate the client
client = fabric.FabricRestClient()

#create the payload
uri = f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/items"
payload = { 
  "type": "Warehouse", 
  "displayName": "WarehouseCaseInsensitive", 
  "description": "Case Insensitive Warehouse", 
  "creationPayload": { 
    "defaultCollation": "Latin1_General_100_CI_AS_KS_WS_SC_UTF8" 
  } 
}

# Call the REST API
response = client.post(uri,json= payload)
display(response)

We can check to ensure the Warehouse is created using case insensitivity by querying sys.databases.

SELECT name, collation_name FROM sys.databases;

Now if we run the "DimCustomer" creation script, the first table is created successfully but the second script will fail with "object already exists" and the second table is not created. We now have unique object names.

If we run the data JOIN example, we can see that all the product data is successfully matched despite the product name casing differences.

References

The Importance of New Case-Insensitive Collation in Fabric Warehouse

An Overview of Lakehouses and Data Warehouses

Re: An Overview of Lakehouses and Data Warehouses

Re: An Overview of Lakehouses and Data Warehouses

Fabric Connectivity Series Part #1 - Inbound SQL D...