How to Create BigQuery User-Defined Functions (UDFs)

Master creating custom logic in BigQuery with user-defined functions. This hands-on tutorial shows you how to build SQL and JavaScript UDFs for complex data transformations.

When working with BigQuery, you'll eventually encounter data processing scenarios that go beyond what standard SQL functions can handle. This tutorial will show you how to create BigQuery user-defined functions (UDFs) to implement custom logic that extends SQL's capabilities. By the end of this guide, you'll be able to write both SQL-based and JavaScript-based UDFs that solve complex data transformation challenges in Google Cloud.

User-defined functions are particularly valuable for Professional Data Engineer exam preparation because they represent a key skill for implementing advanced data processing pipelines in GCP. Understanding how to create BigQuery user-defined functions allows you to handle specialized business logic, perform complex string manipulations, and encapsulate reusable code that can be shared across multiple queries.

What You'll Accomplish

In this tutorial, you'll create several types of user-defined functions in BigQuery, starting with simple examples and progressing to real-world applications. You'll learn to build JavaScript UDFs for complex text processing, SQL UDFs for reusable calculations, and persistent UDFs that can be called across multiple queries and users. By the time you complete these steps, you'll have practical experience implementing custom logic that enhances your Google Cloud data warehouse capabilities.

Prerequisites and Requirements

Before you begin, ensure you have the following. You need a Google Cloud account with an active project and the BigQuery API enabled in your GCP project. You'll also need BigQuery Data Editor or BigQuery Admin role for your project, plus access to the BigQuery console or gcloud CLI installed locally. Basic understanding of SQL syntax and query structure is required. Familiarity with JavaScript syntax is helpful but not required.

Estimated time to complete: 45 minutes

Understanding BigQuery UDF Types

Before diving into implementation, you should understand the two primary types of user-defined functions available in BigQuery. SQL UDFs let you encapsulate SQL expressions into reusable functions, while JavaScript UDFs provide more powerful capabilities for complex logic that would be difficult or impossible to express in SQL alone.

SQL-based functions execute faster because they run natively within BigQuery's execution engine. JavaScript functions offer greater flexibility for implementing complex algorithms, working with JSON structures, or applying custom business logic. The Google Cloud platform supports both temporary UDFs that exist only for the duration of a single query and persistent UDFs stored in datasets for repeated use.

Step 1: Create Your First Temporary JavaScript UDF

Let's start by creating a simple JavaScript UDF that demonstrates the basic syntax. This temporary function will only exist during the execution of the query where it's defined.

Open the BigQuery console and create a new query. Enter the following code:

CREATE TEMP FUNCTION custom_greeting(name STRING)
RETURNS STRING
LANGUAGE js AS r"""
  return 'Hello, ' + name + '!';
""";

SELECT custom_greeting('Kahlil') AS greeting;

Click "Run" to execute the query. You should see a result with one row containing "Hello, Kahlil!" in the greeting column.

Here's what each part does. The CREATE TEMP FUNCTION statement defines a temporary function that exists only for this query. The function name is custom_greeting and it accepts one parameter called name of type STRING. The RETURNS STRING clause specifies what data type the function returns. The LANGUAGE js tells BigQuery this is a JavaScript function. The JavaScript code itself goes between the triple quotes, where we concatenate strings to create a personalized greeting.

Step 2: Build a SQL-Based UDF for Calculations

Now let's create a SQL UDF that performs a calculation. SQL UDFs are ideal when your logic can be expressed using standard SQL operations. They typically execute faster than JavaScript UDFs because they compile directly into BigQuery's native execution engine.

Create a new query with this code:

CREATE TEMP FUNCTION calculate_margin(revenue FLOAT64, cost FLOAT64)
RETURNS FLOAT64 AS (
  ROUND((revenue - cost) / revenue * 100, 2)
);

SELECT 
  'Product A' AS product_name,
  150.0 AS revenue,
  90.0 AS cost,
  calculate_margin(150.0, 90.0) AS margin_percentage;

This query creates a function that calculates profit margin as a percentage. The function takes revenue and cost as inputs, computes the margin, and rounds the result to two decimal places. The SELECT statement demonstrates calling the function with sample data.

Execute the query and verify that the margin_percentage column shows 40.0, representing a 40% profit margin.

Step 3: Create a Persistent UDF in a Dataset

Temporary UDFs are useful for one-time operations, but persistent UDFs can be saved in a dataset and reused across multiple queries by different users. This makes them valuable for standardizing business logic across your organization's data pipelines in GCP.

First, ensure you have a dataset where you can create the function. If you need to create one, run:

bq mk --dataset --location=US my_functions

Now create a persistent UDF in that dataset:

CREATE OR REPLACE FUNCTION `your-project-id.my_functions.standardize_phone`(phone STRING)
RETURNS STRING
LANGUAGE js AS r"""
  if (!phone) return null;
  // Remove all non-numeric characters
  var cleaned = phone.replace(/\D/g, '');
  // Format as (XXX) XXX-XXXX if 10 digits
  if (cleaned.length === 10) {
    return '(' + cleaned.substring(0,3) + ') ' + 
           cleaned.substring(3,6) + '-' + cleaned.substring(6);
  }
  return cleaned;
""";

Replace your-project-id with your actual GCP project ID. This function standardizes phone numbers into a consistent format, removing special characters and applying standard US phone number formatting.

After creating the function, you can call it from any query:

SELECT 
  `your-project-id.my_functions.standardize_phone`('555-123-4567') AS formatted_phone;

The result should be "(555) 123-4567".

Step 4: Build a JavaScript UDF for JSON Processing

JavaScript UDFs work well with JSON data structures. This example shows how to create a function that extracts specific fields from JSON strings, which is common when dealing with semi-structured data in BigQuery.

CREATE TEMP FUNCTION extract_user_tier(user_metadata STRING)
RETURNS STRING
LANGUAGE js AS r"""
  try {
    var data = JSON.parse(user_metadata);
    if (data.subscription && data.subscription.tier) {
      return data.subscription.tier.toUpperCase();
    }
    return 'UNKNOWN';
  } catch (e) {
    return 'ERROR';
  }
""";

WITH sample_data AS (
  SELECT '{"subscription": {"tier": "premium", "active": true}}' AS metadata
  UNION ALL
  SELECT '{"subscription": {"tier": "basic", "active": true}}'
  UNION ALL
  SELECT '{"invalid_json"'
)
SELECT 
  metadata,
  extract_user_tier(metadata) AS user_tier
FROM sample_data;

This function parses JSON strings and extracts the subscription tier, handling errors gracefully. The query demonstrates the function with three test cases, including valid JSON with different tier values and invalid JSON to test error handling.

Verification and Testing

After creating each UDF, you should verify it works correctly with different inputs. Test edge cases like NULL values, empty strings, and unexpected data types. For the phone standardization function, try these test cases:

SELECT 
  input_phone,
  `your-project-id.my_functions.standardize_phone`(input_phone) AS formatted
FROM UNNEST([
  '5551234567',
  '(555) 123-4567',
  '555.123.4567',
  NULL,
  '12345'
]) AS input_phone;

Each input should produce the expected output format or handle the edge case appropriately.

Real-World Application: Industry-Specific Examples

Healthcare: Patient Record Anonymization

A hospital network needs to anonymize patient identifiers for research datasets while maintaining consistent pseudonyms. Here's a UDF that creates deterministic hashes:

CREATE OR REPLACE FUNCTION `healthcare-project.analytics.anonymize_patient_id`(patient_id STRING, salt STRING)
RETURNS STRING
LANGUAGE js AS r"""
  function hashString(str) {
    var hash = 0;
    for (var i = 0; i < str.length; i++) {
      var char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;
    }
    return Math.abs(hash).toString(36).toUpperCase().padStart(10, '0');
  }
  return 'ANON_' + hashString(patient_id + salt);
""";

SELECT 
  `healthcare-project.analytics.anonymize_patient_id`(patient_id, 'secure_salt_2024') AS anonymous_id,
  diagnosis_code,
  treatment_date
FROM `healthcare-project.records.patient_visits`;

Telecommunications: Call Duration Categorization

A mobile carrier needs to categorize call durations into billing tiers for analytics. This SQL UDF implements the business logic:

CREATE OR REPLACE FUNCTION `telecom-project.billing.categorize_call_duration`(seconds INT64)
RETURNS STRING AS (
  CASE 
    WHEN seconds <= 60 THEN 'SHORT'
    WHEN seconds <= 300 THEN 'MEDIUM'
    WHEN seconds <= 900 THEN 'LONG'
    ELSE 'EXTENDED'
  END
);

SELECT 
  call_id,
  duration_seconds,
  `telecom-project.billing.categorize_call_duration`(duration_seconds) AS duration_tier,
  rate_per_minute
FROM `telecom-project.calls.call_records`
WHERE DATE(call_timestamp) = CURRENT_DATE();

Agriculture: Sensor Data Validation

An agricultural monitoring platform collects soil moisture readings from IoT sensors. This UDF validates readings and flags anomalies:

CREATE OR REPLACE FUNCTION `agritech-project.sensors.validate_moisture_reading`(
  reading FLOAT64, 
  sensor_type STRING
)
RETURNS STRUCT
LANGUAGE js AS r"""
  var result = {is_valid: true, flag: 'NORMAL'};
  
  if (reading === null || reading < 0 || reading > 100) {
    result.is_valid = false;
    result.flag = 'OUT_OF_RANGE';
    return result;
  }
  
  if (sensor_type === 'capacitive' && reading > 90) {
    result.flag = 'POSSIBLE_FLOODING';
  } else if (sensor_type === 'capacitive' && reading < 10) {
    result.flag = 'DROUGHT_CONDITION';
  }
  
  return result;
""";

SELECT 
  sensor_id,
  timestamp,
  moisture_level,
  `agritech-project.sensors.validate_moisture_reading`(moisture_level, sensor_type).*
FROM `agritech-project.iot.sensor_readings`
WHERE DATE(timestamp) = CURRENT_DATE();

Common Issues and Troubleshooting

JavaScript Syntax Errors

If you see "Invalid function body" errors, check that your JavaScript code is syntactically correct. The error message often indicates the line number. Remember that JavaScript in UDFs has some limitations compared to standard JavaScript. For example, you can't use external libraries or make network calls.

Type Mismatch Errors

BigQuery is strict about data types. If you pass an INT64 to a function expecting STRING, you'll get an error. Always cast your inputs explicitly:

SELECT my_function(CAST(numeric_column AS STRING)) FROM table;

Null Handling

UDFs must handle NULL inputs gracefully. JavaScript UDFs receive NULL as null, while SQL UDFs can use IS NULL checks. Always test your functions with NULL values:

SELECT my_function(NULL) AS result;

Performance Issues with JavaScript UDFs

JavaScript UDFs are slower than native SQL or SQL UDFs because they require additional processing overhead. If your UDF is causing slow queries, consider whether you can rewrite the logic in SQL. Use the query execution plan in the BigQuery console to identify performance bottlenecks.

Integration with Other GCP Services

BigQuery UDFs work well with other Google Cloud services. When building data pipelines with Cloud Dataflow, you can call BigQuery UDFs in your destination queries to transform data as it lands. Cloud Functions can trigger BigQuery queries that use UDFs to process data in response to events.

For example, a Cloud Composer workflow might orchestrate multiple BigQuery jobs that use shared UDFs for consistent data transformation across different pipeline stages. You can also reference UDFs in materialized views, allowing Google Cloud to automatically maintain pre-computed results that include your custom logic.

When working with Cloud Storage, you might use BigQuery's external table functionality combined with UDFs to process files directly. This pattern is common when dealing with logs or semi-structured data stored in GCS buckets.

Best Practices and Recommendations

Organize Functions by Purpose

Create separate datasets for different categories of UDFs: one for data quality functions, another for business logic, and another for formatting utilities. This organizational structure makes functions easier to discover and maintain across your GCP project.

Document Your Functions

While BigQuery doesn't support inline documentation in UDF definitions, maintain external documentation that describes each function's purpose, parameters, return values, and examples. Consider creating a table in BigQuery that catalogs all your functions:

CREATE OR REPLACE TABLE `your-project-id.metadata.udf_catalog` (
  function_name STRING,
  dataset STRING,
  description STRING,
  parameters STRING,
  example_usage STRING,
  created_by STRING,
  created_date DATE
);

Version Control for Production UDFs

Store your UDF definitions in version control systems like Cloud Source Repositories or GitHub. This practice allows you to track changes, review modifications, and roll back if needed. Include UDF creation statements in your infrastructure as code deployments.

Security and Access Control

Control access to UDFs using BigQuery dataset permissions. Users need appropriate access to both the UDF and any tables referenced within it. Be cautious with JavaScript UDFs that process sensitive data, as the function logic is visible to anyone with access to view the function definition.

Cost Considerations

JavaScript UDFs consume more slot time than equivalent SQL operations. For high-volume queries, this can impact costs. Monitor your query costs in the Google Cloud console and optimize expensive UDFs by converting them to SQL when possible.

Testing and Validation

Create comprehensive test suites for your UDFs before deploying them to production. Use BigQuery's scripting capabilities to automate testing:

DECLARE test_passed BOOL DEFAULT TRUE;

SET test_passed = (
  SELECT COUNT(*) = 0
  FROM (
    SELECT `project.dataset.my_function`(test_input) AS result, expected_output
    FROM `project.dataset.function_tests`
  )
  WHERE result != expected_output
);

IF NOT test_passed THEN
  RAISE USING MESSAGE = 'Function tests failed';
END IF;

Next Steps and Enhancements

Now that you can create BigQuery user-defined functions, consider these advanced topics. Explore parameterized UDFs that accept multiple complex data types like ARRAY or STRUCT. Learn to create aggregate UDFs that work with GROUP BY clauses to implement custom aggregation logic.

Investigate using external code libraries with JavaScript UDFs through the experimental support for including external JavaScript libraries. Study how UDFs interact with BigQuery ML models, allowing you to create custom preprocessing functions for machine learning workflows in Google Cloud.

Consider building a library of reusable UDFs that standardize common operations across your organization. This approach promotes consistency and reduces code duplication across multiple data pipelines and analytics queries in your GCP environment.

Summary

You've now learned how to create BigQuery user-defined functions using both SQL and JavaScript, from simple temporary functions to persistent, production-ready implementations. You can build custom logic for data transformations, implement business rules, and extend BigQuery's capabilities to meet specific requirements that standard SQL functions can't address.

These skills are fundamental for building sophisticated data processing pipelines in Google Cloud and represent important knowledge for data engineering roles. Understanding how to create BigQuery user-defined functions enables you to implement complex transformations, maintain consistent business logic, and optimize query reusability across your organization's analytics platform.

For those preparing for the certification exam, readers looking for comprehensive exam preparation can check out the Professional Data Engineer course, which covers UDFs alongside other advanced BigQuery features and GCP data engineering concepts.