Glossary

What are user-defined functions?

Software systems with limited native capabilities can leave users without important functionality. User-defined functions (UDFs) can be used to extend a system's capabilities in various contexts: from general programming languages to databases and big data platforms. Wherever there's a need to incorporate custom logic into an existing system, UDFs can provide that flexibility.

In the context of databases, a UDF is additional code that extends the database server’s functionality by adding a new custom function that can be called in queries. Take, for example, a relational database without a built-in function to convert temperatures from Celsius to Fahrenheit; a user could define this function and then use it in SQL queries as if it were built-in. 

The SQL standard defines different types of UDFs as: user-defined scalar functions that take input variable names and return a scalar value (like our temperature conversion example), and table functions that do the same thing for an entire table (multiple rows and columns).

It’s important to note that UDFs are distinct from stored procedures, even though both involve user-defined logic on the database server. A stored procedure is usually a collection of native SQL statements that perform a sequence of operations (possibly affecting multiple records or tables) and is invoked as a whole. A stored procedure cannot be used inline within a query’s expressions. A UDF, on the other hand, is treated more like an expression and can be used within a SELECT, WHERE, or other clauses just like any built-in function. 

Stored procedures are called to execute entire routines. UDFs are typically called to compute a value (or a table) that can integrate into a larger query. This distinction means UDFs are often subject to certain restrictions (for example, many SQL databases disallow UDFs from producing side effects, such as data modification) to ensure they behave like pure functions within a query.

Five signs you've outgrown DynamoDB

Discover DynamoDB's hidden costs and constraints — from rising expenses to performance limits. Learn how modern databases can boost efficiency and cut costs by up to 80% in our white paper.

Applications of user-defined functions

User-defined functions were born from the need for customization. When built-in functionality isn’t enough, UDFs can be applied across various systems, from traditional relational databases to modern NoSQL and big data platforms.

Relational databases

Developers and database administrators (DBAs) can use UDFs to extend SQL with custom computations in relational databases. Using commands like CREATE FUNCTION, they can then invoke those statements as if they're native queries. 

This can be useful for encapsulating complex calculations or business rules. If an application frequently needs to calculate a custom metric or transform data in a specific way, writing a UDF lets the logic be written once and reused across many queries.

Relational databases support different kinds of UDFs. A scalar UDF returns a single value and can be used anywhere a value is expected (for example, to compute a formatted string or a mathematical result for each row). A table-valued UDF returns an entire table result that can be joined or queried as if it were a table; this is useful for generating a set of records from input parameters. Modern SQL dialects often support both scalar and table UDFs, and some even allow multi-statement UDFs (which allow procedural logic to be written inside the function to derive the result).

To give a real-world example, take Microsoft SQL Server, where UDFs can be scalar or table-valued and enable modularization of logic. You might create a function definition to standardize a phone number format, then use it in a SELECT query to format phone numbers for all customers. Other SQL systems like PostgreSQL, Oracle, DB2, and MySQL all support user-defined functions with similar concepts. They even allow writing UDFs in different languages: SQL Server lets you create UDFs in .NET languages like C#, and Oracle lets you use Java or C for UDFs. With this flexibility, a developer can choose a language that's best suited for the specific task or one they're comfortable with, while the database ensures the function can be called from SQL.

It's common to use UDFs in relational databases to implement logic that would be cumbersome or impossible to express with a single SQL query. By writing a UDF, the complex logic can be tucked behind a function name. If multiple queries or applications need the same calculation, a UDF also ensures reusability and consistency since they all use identical logic, defined in one place. Because the UDF runs on the server, it can reduce the amount of data needed to send to the client or application. Instead of fetching raw data and processing it externally, the database can do the calculation through the UDF and just return the final result.

It’s worth noting that there are some limitations and best practices. For example, many relational databases treat UDFs as black boxes for optimization; the query planner might not optimize across the function boundary. In some systems, a scalar UDF will be executed for every row, which can be slower than an equivalent set-based operation. There have been improvements over time (some databases can inline or optimize certain UDFs), but users should be mindful when using them heavily in large queries.

NoSQL and key-value databases

UDFs aren’t limited to traditional SQL databases. Many NoSQL databases and key-value stores also offer similar “user-defined logic” capabilities, though they might be implemented differently. A high-performance NoSQL database like Aerospike allows users to write UDFs in a Lua script to run on the database servers. In Aerospike, there are two flavors: record UDFs that execute on a single data record (for example, to apply a transformation or update to that record), and stream UDFs that operate on a set of records for aggregation or filtering purposes. 

Stream UDFs essentially provide a form of parallel, in-database MapReduce: you can send a query that's applied across all nodes, each running the UDF on local data (for example, counting or summing values), and then combining the results. Unlike batch MapReduce jobs, these are designed to be low-latency in a shared-nothing cluster architecture. The ability to execute such logic directly where the data lives is crucial for real-time analytics and fast computations without extracting all the data.

Big data processing frameworks

Beyond traditional databases, UDFs are prevalent in big data and analytics frameworks. Systems like Apache Hive (a SQL-like query engine on Hadoop), Apache Spark, and modern cloud data warehouses all support user-defined functions to extend their data processing languages.

Hive and SQL-on-Hadoop

Apache Hive introduced the concept of UDFs to the Hadoop ecosystem. Hive’s query language (HQL) is similar to SQL, and it allows developers to register UDFs written in Java to perform custom computations in queries. Hive even categorizes UDFs further: there are regular UDFs for scalar functions; user-defined aggregate functions (UDAFs) for defining new aggregations (like computing a custom average or statistical measure over groups of rows); and user-defined table-generating functions (UDTFs) for turning a single row into multiple rows (like a split function that outputs a table). 

This extensibility was important because early big data use cases often had unique processing needs that weren’t met by the limited SQL-like features of Hive. By writing a Java UDF, developers can plug in their own logic while still leveraging the power of a distributed query engine to apply it to large datasets.

Spark and DataFrame

The general-purpose cluster computing framework Apache Spark also supports user-defined functions in its SQL and DataFrame APIs. You can create UDFs in Spark using languages like Scala, Java, Python, or R (depending on the preferred API) and apply them to DataFrame columns or SQL queries. If you have a complex formula to apply to each record, you can register a UDF in PySpark and use it in a select or where clause. 

However, the Spark community often cautions that UDFs may not be as efficient as Spark’s built-in functions. Built-in functions in Spark are optimized in Java/Scala and can take advantage of vectorized execution and other performance features, whereas a Python UDF, for instance, might force row-by-row processing. So Spark UDFs are used when necessary, for logic that can't be expressed with the high-level DataFrame operations or SQL, but users are encouraged to use built-in transformations when possible for performance reasons. Still, UDFs are invaluable for custom tasks like parsing a proprietary data format, implementing a non-standard calculation, or integrating with external libraries, all while staying in the realm of a big data query.

Cloud data warehouses and others

Modern cloud databases and warehouses (like Snowflake, Google BigQuery, and Amazon Redshift) also support user-defined functions. Many of these systems allow UDFs in SQL (sometimes even letting you write them in Python or JavaScript). 

Snowflake enables you to write JavaScript UDFs, and Google BigQuery supports SQL and JavaScript for UDFs. These are used to extend the analytic capabilities of these platforms; you might write a UDF for a complex statistical test or a custom string manipulation that isn’t built-in. One interesting trend is the use of secure, sandboxed environments for these UDFs (for example, some platforms offering “secure UDFs” that cannot access the external network or data, ensuring functions can’t accidentally leak data). Some analytical databases even support UDFs written in C++ for high-performance extensions.

Across big data and analytics tools, UDFs provide the flexibility to handle custom requirements. They allow these platforms to adapt to many domains because users can bring their own code into the execution engine. On the flip side, there’s a need to manage these functions (versioning, testing) and often a performance consideration: running custom code might be slower than native operations. 

Each platform has its own framework for registering and using UDFs, but the core idea remains: push the computation to the data rather than pulling the data out to a program. This theme of pushing computation to data is central to why UDFs are used.

Aerospike vs. DynamoDB: See the benchmark results

DynamoDB struggles to maintain performance at scale, and its pricing only worsens as you grow. If your applications demand predictable low latency, high throughput, and operational affordability, Aerospike is the better choice. The results are clear: Aerospike outperforms on every front, including latency, throughput, and total cost of ownership, at every scale.

Benefits of user-defined functions

Developers and data engineers reach for UDFs because they have several important benefits that make them a compelling feature in many systems.

Customizing and extending functionality

A UDF enables users to add new functionality to a system that doesn't provide exactly what they need, and it can be written through a mathematical formula, a string processing routine, a data conversion, or any domain-specific logic. This extensibility is crucial in complex applications, such as a financial database in need of a proprietary risk calculation formula. Analysts can call the formula in their queries just like any other function after implementing it as a UDF.

Code reuse and modularity

UDFs encourage more modular programming, especially in databases. Instead of repeating complex expressions across queries or applications, users can define the logic once as a function and reuse it. This leads to code that's easier to maintain: if the logic needs to change, you update the UDF in one place rather than editing many queries or application code. SQL Server’s documentation notes that users can create the function once and store it in the database, then call it any number of times. It can even be modified independently of the application’s source code. 

UDFs bring the software engineering principle of “write once, use anywhere” into the database world. Business rules encapsulated in UDFs ensure consistency; everyone calling the function gets the same result, using the same approved logic.

This modular approach also simplifies complex operations. A query or program can become more readable when a chunk of logic is tucked behind a descriptive function name. It’s clearer to write SELECT ... WHERE isCustomerActive(user_status) than to inline the possibly messy logic every time. Function declarations keep all business logic in one place and outside of the main loop, and (if written well) allow for a "documentation-as-code" approach that lets other users understand what the function does and how it does it without needing mountains of additional docs. In short, UDFs allow for a more standardized, object-oriented approach to data manipulation within databases.

Performance and efficiency (through data locality) 

UDFs can reduce network traffic and latency by running custom logic on the server side (in the database or processing engine). Consider an application that needs to filter or aggregate data based on complex conditions across attributes. Without UDFs, the application might retrieve a large dataset from the database and then filter/aggregate in application code. This involves transferring a lot of data over the network. 

With a UDF, filtering or aggregation can occur within the database engine, and only the reduced results are returned. Executing custom logic (like filtering or aggregation) directly in the database can avoid multiple client-server round-trips. Running logic where the data lives reduces latency and saves network resources. In short, UDFs enable pushing computation to the data, which is usually more efficient than pulling data to the computation.

Some databases can even optimize repeated executions of UDFs. For example, SQL Server caches the execution plan for T-SQL UDFs after the first use, so subsequent calls avoid recompiling the logic, leading to faster execution on repeated runs. And as long as the UDF is deterministic (always produces the same output for the same input), some engines might cache results or make other optimizations. 

There's also ongoing research and development in database systems to make UDFs faster, including techniques like inlining UDF code into the query plan or using just-in-time compilation. All these efforts aim to combine the flexibility of UDFs with the performance of native operations.

Another performance angle is parallelism. In distributed systems, UDFs can run in parallel across many nodes (for example, each data node running the UDF on its portion of data). This is how stream UDFs in Aerospike achieve real-time parallel processing. In big data platforms like Spark or Flink, if you use a UDF on a dataset, the framework will execute that UDF on partitions of data in parallel (though, as noted, certain types of UDFs can prevent other optimizations). The ability to process data in place and in parallel can speed up complex operations.

Simplifying application logic

A lot of heavy logic can be moved out of the application layer into the database or processing engine by utilizing UDFs. This not only benefits performance but also simplifies the application code. The database can be tasked with more of the “thinking,” so the client code can remain lean. For example, an application might call a stored procedure or query that uses UDFs and get back exactly what it needs, rather than getting raw data and then doing further computation. This can reduce the overall development effort and ensure that the logic runs in a controlled, optimized environment (the database).

Putting the logic in a UDF ensures that different applications or microservices sharing the same database rely on a single implementation; this mitigates duplicate logic and reduces the risk of potential inconsistencies. It also centralizes certain business rules or computations within the database layer. In analytics scenarios, data scientists can store UDFs in a centralized and accessible location so that other analysts can simply call those functions in SQL without needing to understand the underlying Python/R code, for instance.

Challenges and considerations of UDFs

While UDFs offer many benefits, they also come with their own challenges and limitations. It’s important to be aware of these when deciding to use UDFs so that you can apply best practices and avoid pitfalls.

Performance overhead

Ironically, the flexibility of UDFs can sometimes hurt performance if used improperly. One common issue in relational databases is that a UDF might be executed in a less optimized way than an equivalent set of SQL operations. For example, a naive scalar UDF in a SQL query might cause the database to evaluate the function for each row of a table, one by one, rather than using a set-oriented approach. Queries with heavy UDF usage can, as a result, become slow. 

Especially when dealing with large datasets, a heavy reliance on complex UDFs that process row-by-row can bring the system to a crawl. Using set-based queries instead may be a more performant option. The advice generally given is to use UDFs judiciously; don’t wrap every little operation in a UDF, especially if standard SQL or built-in functions can do the job efficiently.

Big data frameworks echo a similar caution. Databricks (Spark) documentation recommends using UDFs only for logic that is hard to express with built-in functions, precisely because the built-in functions are heavily optimized for distributed processing. For regularly run ETL jobs or very large data, a hand-built UDF (especially in an interpreted language like Python) might become a bottleneck. 

It may be better to find a vectorized operation or break the problem down using the platform’s native APIs. If you do need a UDF for complex tasks on big data, newer features like pandas UDFs (which allow vectorized operations in Spark using Python) or UDFs in compiled languages can help mitigate some overhead.

The good news is that database vendors are aware of these issues and have been improving UDF performance. Some modern SQL engines can inline certain UDFs (eliminating the function call overhead by merging the logic into the main query). Others avoid context-switching penalties by using the same execution engine for UDF code. 

When writing a UDF, one should consider its complexity and how often it will be called. Profiling and testing UDF-heavy workloads is important to ensure they meet performance requirements. If a particular UDF turns out to be a hotspot, sometimes the solution is to refactor the logic back into a set-based query or use an alternative approach (such as a stored procedure for batch processing).

Restricted side effects and safety

By design, many UDFs are restricted in what they can do. In relational databases, a UDF (especially a scalar UDF) is often not allowed to have side effects, meaning it shouldn’t permanently change the database state as part of computing a result. This is to keep them behaving like mathematical functions (so the optimizer and engine can handle them with predictability). 

SQL Server doesn't allow a UDF to insert or delete table rows as part of its execution. UDFs also can't call non-deterministic operations like NEWID() or getdate(), unless they’re marked appropriately. These restrictions can confuse new users, driving searches for “why can’t my function update a table?” But they exist to mitigate unexpected behavior and maintain the integrity of queries. If you need to do such actions, a stored procedure is usually more appropriate than a UDF.

From a safety perspective, running arbitrary user code on a database server raises concerns. A poorly written UDF could go into an infinite loop, consume excessive CPU or memory, or even try to access resources it shouldn’t. Therefore, databases often sandbox UDFs. Sandboxing means the UDF runs in a controlled environment where it can’t do certain things and will be stopped if it misbehaves. 

Aerospike, for example, runs Lua UDFs in-process for performance, but it has protections like timeouts to break out of any infinite loops or long-running execution. Many systems also limit resource usage for UDFs and isolate them from critical server internals. In some platforms, user-defined code runs in a separate process or container to isolate failure domains from the main database engine and add a layer of safety between UDFs and primary database processes. This is an area of active development, with emerging technologies being explored to support portable UDFs across databases.

The implication for users is that when writing a UDF, you often have a subset of capabilities available. For example, a UDF might not be able to make network calls or read from the file system. You should be mindful of the sandbox rules of your platform. Also, error handling in UDFs can be tricky: if a UDF runs into an error, it typically aborts the query or transaction that invoked it. This means you should write defensive code inside the UDF, checking for edge cases or bad input, to avoid exceptions that might fail larger operations.

Debugging and development complexity

Another challenge with UDFs is debugging them. When your code runs inside a database engine or distributed system, you may not have the same debugging tools available as you do in a normal application. Logging might be limited, error messages might be terse, and setting up a step-through debugger is usually not possible. 

Additionally, deploying changes to UDFs requires care. In a database cluster, a UDF definition might need to propagate to all nodes. Aerospike, for instance, has a mechanism to distribute the UDF code (the Lua script) to all nodes in the cluster when you register it. Until that distribution is done, you have to be careful about using the UDF, especially in a mixed-version environment. Versioning of UDFs (making sure new versions don’t break queries expecting old logic) is something to plan for; sometimes it’s solved by creating a new function name for a new version while retiring the old one later.

From a design standpoint, one must also consider maintainability. Scattering too much business logic into database UDFs can make an architecture hard to understand, especially if some logic is in application code and some in UDFs. It requires discipline to document and manage these functions as part of your overall codebase. Some organizations treat database code (UDFs, stored procedures, etc.) with the same rigor as application code, including it in source control, writing tests for it, and doing code reviews, which is a good practice to adopt.

When to use (or avoid) UDFs

Given the benefits and drawbacks, it’s worth summarizing when using a UDF is most appropriate. UDFs shine when you need to execute custom logic on the server side for efficiency or consistency, especially if that logic will be reused often. They are great for encapsulating complex calculations, performing filters or transformations to cut down data transfer, and extending query capabilities in a convenient way. Ad-hoc analyses and one-off data processing tasks can safely use UDFs to speed up development. For example, a data analyst might write a quick Python UDF in Spark to parse a new kind of data without waiting for the engineering team to add a native function.

On the other hand, if a particular operation expressed as a UDF is going to be applied to massive data on a routine basis (like a nightly ETL on billions of records), you should evaluate its performance carefully. It might be worth investing in a custom solution or leveraging existing features if possible. Also, if the logic involves heavy data manipulation across many rows, consider if a set-based approach (in SQL) or built-in aggregation might handle it more efficiently than a UDF. In systems where UDFs hinder optimizations (for example, some SQL engines where UDFs disable parallelism of a query), using them might drastically slow down an otherwise fast query.

Use UDFs for the power and flexibility they offer, but do so with awareness. They are like spice in a recipe: a little can enhance the flavor of your system, but too much can ruin the dish. Now, to wrap up our overview, let’s revisit how UDFs compare to the older concept of stored procedures, as this is a common point of confusion.

DynamoDB Migration Guide

DynamoDB is great when you’re just getting started, but once traffic spikes, cost and complexity follow. If you’re scaling up and struggling with inconsistent throughput, slow auto-scaling, or unpredictable latencies, it’s time to look at Aerospike. This guide shows you how to make the switch cleanly and confidently.

User-defined functions vs. stored procedures

UDFs and stored procedures are both ways to execute user-defined logic on a database server, but they serve different purposes. As mentioned earlier, a stored procedure is generally a batch of SQL statements (and possibly flow-of-control statements) that is stored under a name and executed as a whole. You might call a stored procedure to perform a series of tasks, such as updating several tables and producing a report. Stored procedures can accept input parameters and return output (either via output parameters or result sets), but they are not typically used within other SQL statements. You invoke a stored procedure with a command (like EXEC proc_name(...)), not as part of a SELECT or WHERE clause.

A user-defined function, by contrast, is usually called within an SQL statement (SELECT, SET clause, etc.) and is expected to return a value or a table that the rest of the query can use. Because of this, UDFs in SQL are subject to constraints that stored procedures are not. For example, a stored procedure can freely modify database state and commit transactions; a UDF (in many SQL implementations) cannot commit changes and is often disallowed from making any persistent changes at all. This functional purity allows UDFs to be embedded in queries without causing side effects mid-query.

Another difference is in how they are used for modularizing logic. Stored procedures are often used for action-oriented tasks (perform a business operation, for example, “transfer money between accounts”), whereas UDFs are for computational tasks (compute and return a result, for example, “convert currency for this value” or “give me the distance between two points”). 

In some database systems, stored procedures can call UDFs and vice versa (with some restrictions). But typically, if you need a piece of logic that returns a result and fits into a larger query, you’d use a UDF. If you need to carry out a multi-step process with decision logic and intermediate steps, potentially changing data, a stored procedure is the way to go.

In modern practice, the line can blur a bit. Some databases allow table-valued functions that act somewhat like stored procedures that return a table. And some procedural extensions (like PostgreSQL’s PL/pgSQL or T-SQL) allow writing either functions or procedures with similar syntax. Still, the key distinction holds: stored procedures are executed as standalone calls and can have side effects, while UDFs are invoked within queries and are expected to behave like expressions. Aerospike’s documentation makes an interesting point in comparing them: in their system, a stream UDF (which processes many records) can resemble a stored procedure in work, but it’s still invoked and managed differently, and is more limited in scope than a general stored procedure.

When designing your database logic, you might use a mix of both. UDFs might handle piecewise calculations and data transformations, and stored procedures might orchestrate higher-level operations using those UDFs. Understanding the difference ensures you choose the right tool for the job and abide by the constraints of each.

Final thoughts on user-defined functions

UDFs can empower developers and analysts to tailor systems to their needs. From enhancing SQL with custom logic to enabling in-database analytics in NoSQL stores, to extending big data processing, UDFs show up everywhere data lives. They encapsulate the principle of moving computation closer to the data, which is essential to performance at scale. By using UDFs, organizations can achieve results that would be difficult or inefficient with out-of-the-box capabilities alone; whether that’s performing complex calculations on millions of rows or simply keeping business logic consistent across applications.

However, poorly written UDFs (or overuse of UDFs) can lead to slowdowns and difficult debugging sessions, which is why it's crucial to follow best practices: keep functions focused and efficient, avoid unnecessary side effects, and test them thoroughly. Monitor the performance of queries or jobs that use UDFs, and be ready to refactor if a particular UDF becomes a bottleneck. In performance-critical systems, consider the implementation language and method, as sometimes a compiled language UDF or a different approach might be warranted for speed.

The landscape of UDFs is also evolving. We see trends like sandboxed execution for security, UDFs written in modern languages (even the same ones used for application development, bridging the gap between app and database logic), and new execution engines (such as running UDFs in secure WebAssembly modules) to make them safer and faster. All of this indicates that UDFs will remain a relevant and growing aspect of data systems.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.