What is Data Profiling?
Data profiling refers to the process of examining the data available in an existing data source (e.g., an Excel sheet) and summarizing its characteristics. The goal is to understand the structure, quality, and content of the data. Common tasks in data profiling include checking for nulls, duplicates, data distribution, data types, and identifying unusual values or outliers.
Objective
We will develop a customized data profiling tool in Excel using VBA. The tool will:
- Analyze data in an Excel worksheet.
- Identify common data profiling metrics such as:
- Missing or blank values
- Duplicates
- Data type mismatches
- Basic statistics (e.g., Min, Max, Average, Count)
- Present the results in a new worksheet for easy review.
VBA Code for Data Profiling Tool
Step 1: Create a New VBA Module
To start, press Alt + F11 to open the Visual Basic for Applications (VBA) editor. Then go to Insert > Module to create a new module.
Step 2: Write the Data Profiling Code
Paste the following VBA code into the module.
Sub DataProfiling()
Dim ws As Worksheet
Dim profilingWs As Worksheet
Dim rng As Range
Dim rowCount As Lon
Dim colCount As Long
Dim colIndex As Integer
Dim cell As Range
Dim dataDict As Object
Dim value As Variant
Dim emptyCount As Long
Dim duplicateCount As Long
Dim minVal As Variant
Dim maxVal As Variant
Dim avgVal As Double
Dim countVal As Long
Dim errorMsg As String
' Initialize dictionary to track column data
Set dataDict = CreateObject("Scripting.Dictionary")
' Prompt user to select the target sheet for profiling
On Error Resume Next
Set ws = Application.InputBox("Select a Worksheet to Profile:", Type:=8)
On Error GoTo 0
' Check if worksheet is selected
If ws Is Nothing Then
MsgBox "No sheet selected. Exiting profiling tool.", vbExclamation
Exit Sub
End If
' Create a new worksheet for profiling results
Set profilingWs = ThisWorkbook.Sheets.Add
profilingWs.Name = "Data Profiling"
' Write header in profiling sheet
profilingWs.Cells(1, 1).Value = "Column Name"
profilingWs.Cells(1, 2).Value = "Missing Values"
profilingWs.Cells(1, 3).Value = "Duplicate Values"
profilingWs.Cells(1, 4).Value = "Min Value"
profilingWs.Cells(1, 5).Value = "Max Value"
profilingWs.Cells(1, 6).Value = "Average Value"
profilingWs.Cells(1, 7).Value = "Value Count"
' Get the range of the data (non-empty cells)
Set rng = ws.UsedRange
rowCount = rng.Rows.Count
colCount = rng.Columns.Count
' Loop through each column to collect profiling data
For colIndex = 1 To colCount
emptyCount = 0
duplicateCount = 0
minVal = ""
maxVal = ""
avgVal = 0
countVal = 0
' Initialize dictionary to track duplicates
Set dataDict = CreateObject("Scripting.Dictionary")
' Loop through each row in the column
For Each cell In rng.Columns(colIndex).Cells
' Check for missing (empty) values
If IsEmpty(cell.Value) Then
emptyCount = emptyCount + 1
Else
' Track values for duplicates
If dataDict.Exists(cell.Value) Then
duplicateCount = duplicateCount + 1
Else
dataDict.Add cell.Value, Nothing
End If
' Track min, max, and average for numerical data
If IsNumeric(cell.Value) Then
If minVal = "" Or cell.Value < minVal Then minVal = cell.Value
If maxVal = "" Or cell.Value > maxVal Then maxVal = cell.Value
avgVal = avgVal + cell.Value
countVal = countVal + 1
End If
End If
Next cell
' Calculate average if there are numeric values
If countVal > 0 Then avgVal = avgVal / countVal
' Output the profiling data into the profiling sheet
profilingWs.Cells(colIndex + 1, 1).Value = rng.Cells(1, colIndex).Value
profilingWs.Cells(colIndex + 1, 2).Value = emptyCount
profilingWs.Cells(colIndex + 1, 3).Value = duplicateCount
profilingWs.Cells(colIndex + 1, 4).Value = minVal
profilingWs.Cells(colIndex + 1, 5).Value = maxVal
profilingWs.Cells(colIndex + 1, 6).Value = avgVal
profilingWs.Cells(colIndex + 1, 7).Value = countVal
Next colIndex
' Auto-fit columns for better readability
profilingWs.Columns("A:G").AutoFit
' Notify user that profiling is complete
MsgBox "Data profiling complete! Check the 'Data Profiling' worksheet for results.", vbInformation
End Sub
Explanation of the Code
- Creating the Profiling Sheet
- The code begins by prompting the user to select a worksheet to profile using an input box. If no worksheet is selected, it exits the macro.
- A new worksheet, « Data Profiling, » is created to store the results of the profiling.
- Profiling Each Column
- The code loops through each column of the selected worksheet and checks for:
- Missing values (Empty cells): This is done using IsEmpty(cell.Value).
- Duplicate values: A dictionary (dataDict) is used to track the values that have already been encountered.
- Numerical analysis (only for numeric data): It calculates the minimum, maximum, and average values, as well as the count of numeric values.
- Writing the Profiling Results
- The results for each column are written in the new « Data Profiling » sheet. The profiling data includes:
- Column name
- Missing values
- Duplicate values
- Minimum value
- Maximum value
- Average value
- Value count
- Formatting and Finalization
- Once the profiling is done, the columns in the « Data Profiling » sheet are auto-fitted for readability.
- Finally, a message box notifies the user that the profiling is complete.
How to Use the Tool
- Open Excel and press Alt + F11 to open the VBA editor.
- Insert a new module (Insert > Module).
- Paste the code provided above into the module.
- Press F5 or run the DataProfiling macro to start the profiling.
- Follow the prompts and select the worksheet to analyze.
- Check the « Data Profiling » sheet for the results.
Enhancements and Customizations
- Additional Statistics: You can add more advanced statistical measures, such as standard deviation, median, or mode.
- Data Type Checking: The current version only checks for numerical data. You can extend this to check for dates or text patterns (e.g., using regular expressions for email addresses or phone numbers).
- Data Visualization: You could create charts or conditional formatting to highlight problematic data (e.g., high percentages of missing or duplicate values).