Finance

Charts

Statistics

Macros

Search

Develop Customized Data Profiling Tools with Excel VBA

What is Data Profiling?

Data profiling refers to the process of examining the data available in an existing data source (e.g., an Excel sheet) and summarizing its characteristics. The goal is to understand the structure, quality, and content of the data. Common tasks in data profiling include checking for nulls, duplicates, data distribution, data types, and identifying unusual values or outliers.

Objective

We will develop a customized data profiling tool in Excel using VBA. The tool will:

  1. Analyze data in an Excel worksheet.
  2. Identify common data profiling metrics such as:
    • Missing or blank values
    • Duplicates
    • Data type mismatches
    • Basic statistics (e.g., Min, Max, Average, Count)
  3. Present the results in a new worksheet for easy review.

VBA Code for Data Profiling Tool

Step 1: Create a New VBA Module

To start, press Alt + F11 to open the Visual Basic for Applications (VBA) editor. Then go to Insert > Module to create a new module.

Step 2: Write the Data Profiling Code

Paste the following VBA code into the module.

Sub DataProfiling()
    Dim ws As Worksheet
    Dim profilingWs As Worksheet
    Dim rng As Range
    Dim rowCount As Lon
    Dim colCount As Long
    Dim colIndex As Integer
    Dim cell As Range
    Dim dataDict As Object
    Dim value As Variant
    Dim emptyCount As Long
    Dim duplicateCount As Long
    Dim minVal As Variant
    Dim maxVal As Variant
    Dim avgVal As Double
    Dim countVal As Long
    Dim errorMsg As String   
    ' Initialize dictionary to track column data
    Set dataDict = CreateObject("Scripting.Dictionary")
    ' Prompt user to select the target sheet for profiling
    On Error Resume Next
    Set ws = Application.InputBox("Select a Worksheet to Profile:", Type:=8)
    On Error GoTo 0  
    ' Check if worksheet is selected
    If ws Is Nothing Then
        MsgBox "No sheet selected. Exiting profiling tool.", vbExclamation
        Exit Sub
    End If   
    ' Create a new worksheet for profiling results
    Set profilingWs = ThisWorkbook.Sheets.Add
    profilingWs.Name = "Data Profiling"   
    ' Write header in profiling sheet
    profilingWs.Cells(1, 1).Value = "Column Name"
    profilingWs.Cells(1, 2).Value = "Missing Values"
    profilingWs.Cells(1, 3).Value = "Duplicate Values"
    profilingWs.Cells(1, 4).Value = "Min Value"
    profilingWs.Cells(1, 5).Value = "Max Value"
    profilingWs.Cells(1, 6).Value = "Average Value"
    profilingWs.Cells(1, 7).Value = "Value Count"   
    ' Get the range of the data (non-empty cells)
    Set rng = ws.UsedRange
    rowCount = rng.Rows.Count
    colCount = rng.Columns.Count   
    ' Loop through each column to collect profiling data
    For colIndex = 1 To colCount
        emptyCount = 0
        duplicateCount = 0
        minVal = ""
        maxVal = ""
        avgVal = 0
        countVal = 0       
        ' Initialize dictionary to track duplicates
        Set dataDict = CreateObject("Scripting.Dictionary")       
        ' Loop through each row in the column
        For Each cell In rng.Columns(colIndex).Cells
            ' Check for missing (empty) values
            If IsEmpty(cell.Value) Then
                emptyCount = emptyCount + 1
            Else
                ' Track values for duplicates
                If dataDict.Exists(cell.Value) Then
                    duplicateCount = duplicateCount + 1
                Else
                    dataDict.Add cell.Value, Nothing
                End If               
                ' Track min, max, and average for numerical data
                If IsNumeric(cell.Value) Then
                    If minVal = "" Or cell.Value < minVal Then minVal = cell.Value
                    If maxVal = "" Or cell.Value > maxVal Then maxVal = cell.Value
                    avgVal = avgVal + cell.Value
                    countVal = countVal + 1
                End If
            End If
        Next cell       
        ' Calculate average if there are numeric values
        If countVal > 0 Then avgVal = avgVal / countVal       
        ' Output the profiling data into the profiling sheet
        profilingWs.Cells(colIndex + 1, 1).Value = rng.Cells(1, colIndex).Value
        profilingWs.Cells(colIndex + 1, 2).Value = emptyCount
        profilingWs.Cells(colIndex + 1, 3).Value = duplicateCount
        profilingWs.Cells(colIndex + 1, 4).Value = minVal
        profilingWs.Cells(colIndex + 1, 5).Value = maxVal
        profilingWs.Cells(colIndex + 1, 6).Value = avgVal
        profilingWs.Cells(colIndex + 1, 7).Value = countVal
    Next colIndex   
    ' Auto-fit columns for better readability
    profilingWs.Columns("A:G").AutoFit  
    ' Notify user that profiling is complete
    MsgBox "Data profiling complete! Check the 'Data Profiling' worksheet for results.", vbInformation
End Sub

Explanation of the Code

  1. Creating the Profiling Sheet
  • The code begins by prompting the user to select a worksheet to profile using an input box. If no worksheet is selected, it exits the macro.
  • A new worksheet, « Data Profiling, » is created to store the results of the profiling.
  1. Profiling Each Column
  • The code loops through each column of the selected worksheet and checks for:
    • Missing values (Empty cells): This is done using IsEmpty(cell.Value).
    • Duplicate values: A dictionary (dataDict) is used to track the values that have already been encountered.
    • Numerical analysis (only for numeric data): It calculates the minimum, maximum, and average values, as well as the count of numeric values.
  1. Writing the Profiling Results
  • The results for each column are written in the new « Data Profiling » sheet. The profiling data includes:
    • Column name
    • Missing values
    • Duplicate values
    • Minimum value
    • Maximum value
    • Average value
    • Value count
  1. Formatting and Finalization
  • Once the profiling is done, the columns in the « Data Profiling » sheet are auto-fitted for readability.
  • Finally, a message box notifies the user that the profiling is complete.

How to Use the Tool

  1. Open Excel and press Alt + F11 to open the VBA editor.
  2. Insert a new module (Insert > Module).
  3. Paste the code provided above into the module.
  4. Press F5 or run the DataProfiling macro to start the profiling.
  5. Follow the prompts and select the worksheet to analyze.
  6. Check the « Data Profiling » sheet for the results.

Enhancements and Customizations

  • Additional Statistics: You can add more advanced statistical measures, such as standard deviation, median, or mode.
  • Data Type Checking: The current version only checks for numerical data. You can extend this to check for dates or text patterns (e.g., using regular expressions for email addresses or phone numbers).
  • Data Visualization: You could create charts or conditional formatting to highlight problematic data (e.g., high percentages of missing or duplicate values).
0 0 votes
Évaluation de l'article
S’abonner
Notification pour
guest
0 Commentaires
Le plus ancien
Le plus récent Le plus populaire
Online comments
Show all comments
Facebook
Twitter
LinkedIn
WhatsApp
Email
Print
0
We’d love to hear your thoughts — please leave a commentx