Creating a Table in SQL: Quick and Easy vs. Full Setup
Creating a table in SQL is one of those things every good data engineer needs to know how to do.
Like most things, there’s an easy way and a good way to do it.
So, without further ado.
Here are the two ways you can create a table in SQL…
Aside: I’ll be using the Cat Fact Breed API endpoint (not because I am a cat person but because it is a free and simple API to use). I’ll also be doing it in Azure SQL, which, as a relational DB, is blasphemy within some data circles. However, the SQL Syntax will be mostly the same if you do it in Snowflake or MySQL — end of Aside.
The Quick and Easy Way
The easy way to do this is simply to take the columns you want mapped and use a “Create Table” statement like so:
CREATE TABLE Cat_Facts_Breeds (
Breed nvarchar(255),
country nvarchar(255),
origin nvarchar(255),
coat nvarchar(255),
pattern nvarchar(255)
)
Hit run, a voila, you have a table in SQL!
The Full Setup
This table is missing a couple of things, though. If you’ve been around the block, you probably saw it immediately.
This table is missing a primary key!
Now, we could use Breed as a primary key. After all, it is the Breed endpoint, so it’s a good guess that this is, in fact, the primary key.
However, the problem with making Breed the primary key is that strings are typically not great indexes since they are complex and lengthy (Think High Cardinality). Natural keys can also be volatile, and a change in the source system can disrupt what is happening in the data warehouse.
Instead, it’s better to create an integer primary key that auto increments like this:
CREATE TABLE Cat_Facts_Breeds (
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Breed nvarchar(255),
country nvarchar(255),
origin nvarchar(255),
coat nvarchar(255),
pattern nvarchar(255))
Now, this is pretty good compared to our previous implementation, but it’s still missing something.
If you’re debugging a pipeline, you likely want to know when the last time data was loaded into the subject table. You could look at the ETL pipeline run, but this doesn’t always tell you if the data actually made it to the table. This is why I always add an ETL UPDATE TIMESTAMP to my tables. You could call it an update date or a create date, but either way, you want to have a column that tells you when each row was loaded, and updated for that matter.
CREATE TABLE Cat_Facts_Breeds (
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Breed nvarchar(255),
country nvarchar(255),
origin nvarchar(255),
coat nvarchar(255),
pattern nvarchar(255),
etl_upd_ts datetime default getdate()
)
Our table is starting to come together!
But what if your process bugged out and duplicate breeds were inserted in the table? This would be a violation of the business logic. It doesn’t make sense to have two breeds of the same name.
All of a sudden, your data is wrong, and this table is, in effect, useless.
To prevent this we’ll add a unique constraint to the table that ensures no duplicates on the breed column.
CREATE TABLE Cat_Facts_Breeds (
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Breed nvarchar(255),
country nvarchar(255),
origin nvarchar(255),
coat nvarchar(255),
pattern nvarchar(255),
etl_upd_ts datetime default getdate()
)
ALTER TABLE Cat_Facts_Breeds
ADD CONSTRAINT UQ_Breed UNIQUE (Breed);
Now that is a good lookin table!
So here are the takeaways when creating tables in SQL:
- Do have Surrogate keys
- Do have update timestamps
- Do have unique constraints (alternate keys) on natural keys