Search

Analytical (Window) Functions in SQL - Navigation Functions

Navigational functions are a category of window functions which are used to access values from rows that are different from the current row in an SQL table. There are many navigational functions such as LEAD, LAG, FIRST_VALUE, LAST_VALUE, NTH_VALUE etc. To better understand these functions, let's work with the following dataset:

Solar Panel Price (1 Watt) 1976-86

The above table contains the prices of solar panels (in dollars) over the years 1976 - 1986. It is quite apparent from the table that the prices went down heavily as the years passed making it extremely cheaper to harness solar power. Now, if we wished to compare previous year's price with the current price in each row, we can use LAG() function to access previous row's value and place it beside the current year's price in each row.

SELECT Year, Price_Per_Watt as Price, 
LAG(Price_Per_Watt) OVER(ORDER BY Year) AS Previous_Year_Price
FROM `solar_panel_price_per_watt` 

In each row, we have previous row's 'Price' value as data for the new column Previous_Year_Price. The new column's value for the year 1976 has been left as NULL because 1976 is the first year and there's no row before that. Let us now calculate the percentage by which prices have fallen each year and arrange the table according to the years with highest drop in price, we can do so using the following query:

SELECT Year, Price_Per_Watt as Price, LAG(Price_Per_Watt) OVER(ORDER BY Year) AS Previous_Year_Price,
CONCAT(ROUND(((Price_Per_Watt - LAG(Price_Per_Watt) OVER(ORDER BY Year))/LAG(Price_Per_Watt) OVER(ORDER BY Year))*100, 2), '%') AS Price_Reduction_Percentage
FROM `solar_panel_price_per_watt`
ORDER BY Price_Reduction_Percentage DESC

Looking at the results, 1978 comes out on top as the year with the highest drop in prices followed by the year 1980. Positional functions are thus, extremely useful in comparing values in the current row with the values in rows below or above. While LAG() is used to access rows from above, LEAD() does the exact opposite - obtains values from rows below the current row.

SELECT Year, Price_Per_Watt as Price, 
LEAD(Price_Per_Watt) OVER(ORDER BY Year) AS Next_Year_Price
FROM `solar_panel_price_per_watt`

Here in each row, we have next row's 'Price' value as data for the new column Next_Year_Price. The new column's value for the year 1986 has been left as NULL because 1986 is the last year and there's no row after that.


Consider the following dataset:

No of Solar Panels Sold across 3 stores

The above dataset displays the sales information of 3 stores selling solar panels over a given year. Now if we wished to compare each store's sales with its previous quarter's sales, we can do so using the following query:

SELECT *, LAG(Units_Sold , 1) OVER(PARTITION BY Store_ID ORDER BY Quarter) AS Units_Sold_Previous_Quarter
FROM `solar_panel_sales`

If we observe the above query, we have a second attribute in the LAG function - LAG(Units_Sold , 1). The '1' here is called the offset value and it is by default set to 1. If we change it to '2', the LAG() function fetches the value from 2 rows before rather than from the previous row. Since we partitioned the table by store_id, each store's 1st quarter is empty in the Units_Sold_Previous_Quarter column because that is the first column and there are no records prior to that. Let us now assume that all these stores opened in that particular year and hence, there no sales before that. So it makes sense replacing those empty cells in the 1st quarter's Units_Sold_Previous_Quarter column with a '0' as there are no actual sales prior to that quarter.. We can do so by adding a 3rd attribute to the LAG function - LAG(Units_Sold , 1, 0).

SELECT *, LAG(Units_Sold , 1, 0) OVER(PARTITION BY Store_ID ORDER BY Quarter) AS Units_Sold_Previous_Quarter
FROM `solar_panel_sales`

However, it is important to note that only those cells that are rendered 'null' due to no preceding rows are subsequently replaced with a '0'. If there exists a 'null' already in the actual values before even applying the function, it is bound to stay as 'null'.


Let us now explore the relatively simpler functions of FIRST_VALUE, LAST_VALUE and NTH_VALUE. The function FIRST_VALUE() is used to access values from the first row in a window frame while the function LAST_VALUE() is used to access values from the last row. Similarly, NTH_VALUE() is used to access values from the nth row provided the 'n' value. These functions return NULL by default if the first value is 'null'. However, if we include 'IGNORE NULLS' in the window function, they retrieve the next non null value. Let us apply these functions on our 'solar_panel_sales' dataset.

SELECT *, FIRST_VALUE(Units_Sold) OVER(PARTITION BY Store_ID ORDER BY Quarter) AS Sales_Q1,
NTH_VALUE(Units_Sold, 2) OVER(PARTITION BY Store_ID ORDER BY Quarter ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Sales_Q2,
NTH_VALUE(Units_Sold, 3) OVER(PARTITION BY Store_ID ORDER BY Quarter ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Sales_Q3,
LAST_VALUE(Units_Sold) OVER(PARTITION BY Store_ID ORDER BY Quarter ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Sales_Q4
FROM `solar_panel_sales` 
ORDER BY 1,2

A closer look at the query reveals a peculiar expression 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' being used for functions NTH_VALUE and LAST_VALUE. To understand why, we must first explore how a window frame is configured in window functions. When we use PARTITION BY Store_ID, the dataset is bifurcated into 3 parts as there are 3 different stores. By default, a window frame starts with the first row of the partition and ends with the current row. So when SQL is going over the first row in each partition, it understands that the frame starts with the first row but also ends with the first row. Whenever it is going over the second row however, it assumes, frame ends with second row. The current row is always assumed to be the end of the frame unless explicitly mentioned otherwise. The argument 'ROWS BETWEEN' updates SQL about the start and end of the frame manually. The parameters 'UNBOUNDED PRECEEDING AND UNBOUNDED FOLLOWING' sets the start of the frame to be the first row of partition and end of the frame to be the last row of partition.