Primary Key

A primary key is defined to identify in a unique way, a set of columns in a table

A primary key has the following specifications:

NOT NULL
UNIQUE

Foreign Key

A foreign key is made with one or several corresponding columns of a primary Key:

To ensure junction between tables and find the corresponding information in a child table (I can reference someone in my Primary table by affecting a number, and I can find a corresponding information in a secondary table by reading all the rows with the same number)
To ensure the referential integrity (I can not delete a record in my primary table if corresponding information still exists in my secondary table)

Integrity Constraint – Natural keys vs surrogate keys

Natural keys
Natural keys are made with columns involved for the need of users, within you insert / modify data.

Finding a natural key means you choose one or several columns : a social ID is by definition unique and not null and is a perfect candidate for being a primary natural key. This particular case works when you are managing people in a table linked to other dependent tables.

Unfortunately, you sometimes manage other informations and finding a natural key, can easily become a nightmare : you often need to compose you key of different types of information, if you respect the ACID rules when you create your database (Atomicity, Consistency, Integrity, Durability)

Surrogate keys
A surrogate key is a key you just made up without any connection with the real data in the database(generally an integer or bigint). By definition, this key is only used to link tables together, but never used directly by the user.

This key are generated most often on one column wether code side (I wouldn’t advise this technic, I will explain why later in this article), or SQL Server side (that is the best choice)

How to choose a type of data for your keys

As we saw previously, the best way to manage a primary key / foreign key is to choose between auto incremented INT or BIGINT.Everything depends on the amount of rows you will have to manage within a single table and the environment within you are running your database.
In a 32bit environment
Do you remember what is a 32bit environment ? It means your system runs with a processor that is able to treat 32bit of data in one clock cycle. If you remember well a computer is cadenced by a clock (value given by the vendors ie 2.8Ghz) that gives the number of time the processor can use its components (buses, piles, accumulator,…) in one second. 32bit represent 4 bytes (1 byte is 8 bit)

An INT is calculated on 4bytes and so can store 2p32 that means can generated an unsigned number between 0 to 4,294,967,295 (which equals to 2³² − 1) that means 4 billions of different numbers! more than enough to manage the whole employees of a company, the salaries, the invoices, even phone numbers of whole people on earth…

In a 32bit environment, an INT perfectly fits with the size of the processor (in one clock cycle, a 4 bytes processor or 32 bit processor can generate / use a 4bytes INT or 32bit INT
In a 64bit environment
Same application for 64 environment : 64bit = 8 bytes

A BIGINT is composed of 8 bytes and can adress an unsigned number from 0 to 18,446,744,073,709,551,615 which equals to 2⁶⁴ − 1. You can generate a BIGINT in one clock cycle and manage it in the same way. This size perfectly fits with the 64bit system
Uniqueidentifier
It’s trendy now to use a lot of uniqueidentifiers in databases. It’s a kind of ‘magic wand’ that allows you to manage a huge amount of data and identify each one in a unique way. The uniqueidentifier can warranty its uniqueness by basing the generated ID on the network card mac address and random operations on the CPU Clock

SELECT NEWID() AS UNIQUEIDENTIFIER

UNIQUEIDENTIFIER
086666BB-FE3E-4CDC-9827-2A5EE3949BF6

An uniqueidentifier is made of 16 bytes. As you can guess, following the previous rules, a uniqueidentifier will need two clock cycles on a 64 bit (or 4 clock cycles on a 32bit environment) to be generated. obviously, it’s not an optimized type for primary keys/foreign keys for a table in order to identify common data.

The only exception you can imagine is linking two different databases that really need uniqueness. The SQL Server Merge Replication uses this data type to link :

the table you have created and
the merge tables installed when you publish some articles on your database.

The merge replication by default do not use your primary keys but add a new column called ROWGUID with a uniqueidentifier type. The triggers are fired each time you insert a data within an implicit or explicit transaction in your table and use this ROWGUID Columns to link with the ID of the merge table.

Comparison between a few type of surrogate keys and natural keys

We will create here some table in a database in order to fulfill then with data and analyse different criterias like:

Number of reads
Number of writes
page count
Percentage of average space used in pages
Percentage of fragmentation in pages

 -- INT IDENTITY 
CREATE TABLE [dbo].[TESTTABLE_INT](
	[ID] [int] IDENTITY(1,1) NOT NULL,
	[FirstName] [char](10) NOT NULL,
	[LastName] [char](20) NOT NULL,
	[Job] [char](15) NULL,
PRIMARY KEY CLUSTERED 
(
	[ID] ASC
))

-- BIGINT IDENTITY 
CREATE TABLE [dbo].[TESTTABLE_BIGINT](
	[ID] [bigint] IDENTITY(1,1) NOT NULL,
	[FirstName] [char](10) NOT NULL,
	[LastName] [char](20) NOT NULL,
	[Job] [char](15) NULL,
PRIMARY KEY CLUSTERED 
(
	[ID] ASC
))

-- UNIQUEIDENTIFIER NEWID 
CREATE TABLE [dbo].[TESTTABLE_NEWID](
      [ID] [uniqueidentifier] DEFAULT (newid()) NOT NULL,
      [FirstName] [char](10) NOT NULL,
      [LastName] [char](20) NOT NULL,
      [Job] [char](15) NULL,
)

-- UNIQUEIDENTIFIER NEWSEQUENTIALID 
CREATE TABLE [dbo].[TESTTABLE_NEWSEQUENTIALID](
      [ID] [uniqueidentifier] DEFAULT (newsequentialid()) NOT NULL PRIMARY KEY CLUSTERED,
      [FirstName] [char](10) NOT NULL,
      [LastName] [char](20) NOT NULL,
      [Job] [char](15) NULL,
)

CREATE TABLE [dbo].[TESTTABLE_NATURALKEYS](
	[FirstName] [char](10) NOT NULL,
	[LastName] [char](20) NOT NULL,
	[Job] [char](15) NOT NULL,
PRIMARY KEY CLUSTERED 
(
	[FirstName] ASC,
	[LastName] ASC,
	[Job] ASC
))

Now we generate data in the differents table with the following script:

PROCEDURE [dbo].[sp_GenerateRandomString]
(
@intGENERATEDSIZE       INTEGER
,@strGeneratedString    VARCHAR(30) OUTPUT
)
AS
DECLARE @REFERENCE                 AS VARCHAR(63)
DECLARE @intREFERENCESIZE          AS INTEGER
DECLARE @strWORD                   AS VARCHAR(30)
DECLARE @Letter                    CHAR(1)

SET @REFERENCE = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz1234567890'
SET @intREFERENCESIZE = len(@REFERENCE);
SET @strWORD =''

WHILE @intGENERATEDSIZE > 0
     BEGIN

          SET @Letter = SUBSTRING(@REFERENCE, CONVERT(INT,(RAND() * @intREFERENCESIZE))+1, 1);
          SET @strWORD = @strWORD + @Letter
          SET @intGENERATEDSIZE = @intGENERATEDSIZE -1;

     END;

SET @strGeneratedString = @strWORD

Now we can insert 20 000 rows in each table

DECLARE @COUNT INT;

SET @COUNT = 0;
DECLARE @strFirstName   VARCHAR(10)
DECLARE @strLastName    VARCHAR(20)
DECLARE @strJob         VARCHAR(15)

WHILE @COUNT < 20000

     BEGIN
          exec sp_GenerateRandomString 20 , @strFirstName OUTPUT
          exec sp_GenerateRandomString 20 , @strLastName OUTPUT
          exec sp_GenerateRandomString 20 , @strJob OUTPUT

          INSERT dbo.TESTTABLE_NATURALKEYS
                   (
		FirstName
		,LastName
		,Job
		) 
		VALUES
		(
		@strFirstName
		,@strLastName
		,@strJob
		)
	SET @COUNT = @COUNT + 1; 

     END;

Here are the graphical results: