What is our primary use case?
Merge is primarily used for change data capture to identify inserts, updates, and deletes for transactions in data warehousing. This is the place where I implement Merge, which is the most efficient solution through SQL rather than using stored procedures to identify inserts and updates separately.
In one of my data warehousing projects, when we load data, it is in medallion layer form with different layers including a raw layer, staging layer, and data mart modeling layer. We generally call these the bronze, silver, and gold layers. When it comes into the gold layer, that is where we generally implement change data capture. The raw layer contains all raw data as we receive it from the source. We then bring the incremental data into the silver layer with basic data cleansing activities. In the gold layer, we perform more transformations and keep history and report-related data. When we make changes in the gold layer, this is where change data capture comes into the picture for identifying changes. When we have a huge table, we do not want to reload everything on a daily basis but instead want to identify what changes have occurred and load only that data.
We can accomplish this in multiple ways. One approach is using an ETL tool to identify the source and target, compare both, and figure out what the incremental changes are, including inserts and updates. Another approach is using a stored procedure, but in that case we do not know how many rows were updated or inserted. The third approach is using SQL statements, which is where Merge comes into the picture. This is one of the most efficient ways of doing things. When working with Snowflake, instead of ETL, ELT became a more optimized way of doing things. During this process with Snowflake, I have tried multiple times using Merge. With Merge, I am able to figure out what the inserts and updates are, capture those records, and load them into the target.
The main use case is finding data changes. There is nothing specific about using a particular project to only use Merge, but for most projects where we do CDC changes, instead of using an ETL tool, we can use native SQL and implement this solution.
What is most valuable?
We can perform inserts and updates, and at the same time when we do these changes, we can compare using hash joins. That is the main advantage of using Merge.
When I use hash joins, I create a hash key. Whenever I want to do a Merge statement with a table having 10 key combinations and 40 non-keys, I do not want to handle all that complexity. I create hash keys for the key combinations and create another hash key for the non-keys. When I join between the source and target table, I can use a hash key join to figure out what are inserts and what are updates.
One significant advantage is the ability to go with open source and not depend on proprietary ETL tools. Tools may be helpful, but they create dependency without much knowledge sharing. If we use open source solutions such as SQL, that will give more exposure and is a common technology skillset. I do not need to look for a skilled person who knows only specific ETL features. If someone has knowledge of data warehousing concepts, I can hire them with basic SQL skills and they can perform the work in SQL itself. This way, I do not need to limit my hiring of resources to a particular skill set. Open source solutions can have a broader category where I can bring in people. Additionally, tool-specific solutions are licensed products that charge millions for an enterprise, so this is also a good cost saving advantage.
What needs improvement?
Regarding optimization, I have observed that on the Snowflake side, with Merge being used on a table that has micro partitions and has been well maintained, the statement execution will be very fast. This helps in both cost savings and performance-wise optimization.
It would be beneficial if there is a common public community so that people who may not be aware of this will become more aware. Additionally, when we do joins within the Merge statement, there is always a chance of getting duplicates. Snowflake added a feature called QUALIFY that allows us to filter based on ranking and eliminate duplicate records. If such a feature were available in regular Merge statements within any database, that would be helpful. This would eliminate the need to rebuild another subquery inside the Merge statement.
For how long have I used the solution?
I have been working in the current field for almost 13 plus years.
What other advice do I have?
In one of my data warehousing projects, when we load data, it is in medallion layer form with different layers including a raw layer, staging layer, and data mart modeling layer. We generally call these the bronze, silver, and gold layers. When it comes into the gold layer, that is where we generally implement change data capture. The raw layer contains all raw data as we receive it from the source. We then bring the incremental data into the silver layer with basic data cleansing activities. In the gold layer, we perform more transformations and keep history and report-related data. When we make changes in the gold layer, this is where change data capture comes into the picture for identifying changes. When we have a huge table, we do not want to reload everything on a daily basis but instead want to identify what changes have occurred and load only that data.
We can accomplish this in multiple ways. One approach is using an ETL tool to identify the source and target, compare both, and figure out what the incremental changes are, including inserts and updates. Another approach is using a stored procedure, but in that case we do not know how many rows were updated or inserted. The third approach is using SQL statements, which is where Merge comes into the picture. This is one of the most efficient ways of doing things. When working with Snowflake, instead of ETL, ELT became a more optimized way of doing things. During this process with Snowflake, I have tried multiple times using Merge. With Merge, I am able to figure out what the inserts and updates are, capture those records, and load them into the target.
With open source solutions, we get pay-as-you-go pricing as we use the service. Additionally, when we have a major issue, we pay for the service we may need to take. This results in very different cost perspectives. We do not need to pay just for the license cost. On the cost optimization side, and from a hiring perspective, if I want a data engineer, I do not need to look for an ETL-specific pipeline developer role. In the current world with hundreds of ETLs available, I do not need to look for a specific ETL person who knows the features of that tool to develop as an expert from day one. I would either need to take an ETL developer and give them leverage of days or weeks to make them feel expertise in this particular ETL tool before starting development. Instead, if I go with an open source solution such as Snowflake or Databricks where I can use SQL or regular Python coding, I can have a very wide variety of people to hire. Anyone who comes with a basic SQL background can directly jump into the work. I do not need to look for a specific tool skill that will take longer to hire. This way I can save time in hiring people. I would rate this solution an 8 out of 10.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?