Scarf tracking pixel
Blog

Database index selection: A comprehensive guide

Learn how to choose the right database indices to accelerate queries, cut I/O and CPU costs, and balance read speed with write overhead in any SQL environment.

May 2, 2025 | 23 min read
Alex Patino
Alexander Patino
Solutions Content Leader

Imagine searching for a specific topic in a large book. If the book had no index, you’d flip through every page until you found what you needed, which is undoubtedly a slow process. An index at the back of the book lets you jump directly to the relevant pages. 

In the same way, a database index is a separate data structure that stores column values in an organized way (often sorted), with pointers to the full table rows. This helps the database find and retrieve data without scanning the entire table, speeding query responses.

Why are indices important? In today’s applications, databases often contain millions of rows, and users expect sub-second query responses. Without indices, the database must perform full table scans to fulfill queries, which becomes painfully slow as data grows. Indices make queries more efficient by reducing the amount of data the database must sift through. 

Proper indexing reduces I/O operations and CPU use. For instance, one project on a large dataset found that adding appropriate indices reduced disk I/O by ~30%, lowering system load and improving overall responsiveness. In another case study, optimizing indices on a slow PostgreSQL query decreased its runtime by 99.9%, turning a practically unusable query into a fast one.

However, indices are not silver bullets. They come with costs and trade-offs. Over-indexing wastes space and degrades write performance. And not every query benefits from an index. For example, if a query needs most of the table’s rows, reading the whole table may be just as fast. Effective index selection is about choosing the right indices that improve query speed while minimizing added overhead. This guide provides a comprehensive breakdown of indexing, types of indices, when to use them, how to analyze queries for indexing, strategies for different workloads, and common pitfalls to avoid in index selection.

The role of index selection in query performance

Selecting the proper indices is one of the most important performance decisions in database design. Indices determine how efficiently the database executes queries. A well-chosen index can change a query that took seconds or minutes into one that runs in milliseconds. Conversely, a missing or poorly chosen index can be the difference between a snappy application and one that crawls.

Here’s how index selection affects performance:

  • Query execution plans: When you run a SQL query, the database query optimizer decides whether to use an index. If a suitable index exists for the query’s WHERE clause or join condition, the optimizer can do an index lookup, such as an index seek/range scan, instead of scanning the whole table. This is usually much faster for selective queries (those that filter on specific conditions). Choosing the right index on the right columns is crucial so the optimizer can use it. If the wrong index is created, such as on a seldom-used column, it might never be used and provide no benefit.

  • Reducing full table scans: indices help avoid full table scans by locating the relevant subset of rows. For example, a query like SELECT * FROM Orders WHERE customer_id = 1234 will do a quick index lookup on the customer_id index to retrieve only that customer’s orders, rather than scanning all orders. This targeted access speeds queries by orders of magnitude. It also reduces disk I/O and memory usage because the database reads fewer pages.

  • Optimizing joins and sorting: Proper index selection also improves join performance and ORDER BY queries. Indexing foreign key columns lets join operations use indices to match rows in related tables instead of nested full scans. Similarly, an index on the columns used in ORDER BY lets the database retrieve rows in sorted order directly from the index, avoiding an additional sort step, but only if the index order matches the query’s order. 

  • Composite conditions: Many queries filter on multiple columns (e.g., WHERE col1 = X AND col2 = Y). The right choice might be a composite index on (col1, col2) rather than two separate single-column indices. A composite index performs the query at once, while multiple single indices might not be used efficiently together. We’ll discuss this in detail later.

Running operational workloads with Aerospike at petabyte scale in the cloud on 20 nodes

Discover how to achieve sub-millisecond performance at petabyte scale while cutting costs by up to 80%. Download the Aerospike white paper developed with Intel and AWS to power real-time applications with unmatched efficiency and reliability.

Types of indices and when to use them

Not all indices are alike. Databases provide several index types, each with different structures and appropriate use cases. Here are the common index types and their characteristics:

1. B-Tree indices (Balanced tree indices): This is the default index type in most relational databases (e.g., MySQL InnoDB, PostgreSQL, Oracle, SQL Server). A B-tree index maintains its entries in sorted order, arranged in a balanced tree structure for fast lookup. Properties of B-tree indices include logarithmic search time and support for range queries and sorting. A B-tree can be used for equality (=) and inequality comparisons (>, <, >=, <=, BETWEEN) and for prefix matching on strings, such as LIKE 'abc%'. Because of their sorted nature, B-tree indices are flexible; the database can traverse the tree to find a starting point and then scan in order for a range of values. They are effective for most queries involving comparisons on data types like numbers, text, and dates, and are generally the go-to index type for OLTP systems. If you create a basic CREATE INDEX without specifying a type, you’re typically getting a B-tree.

2. Hash indices: A hash index uses a hash table internally. Instead of maintaining sorted order, it hashes the index key and points to locations of matching values. The result is fast equality lookups (O(1) on average). However, hash indices support only equality (=) comparisons; they cannot be used for ranges or ordering. For example, a hash index on name can speed WHERE name = 'Alice', but not WHERE name > 'Alice' or LIKE 'A%'. Hash indices also typically require the entire key to look up, with no partial prefix matches.

3. Bitmap indices: A bitmap index is an indexing strategy often used in data warehousing/OLAP contexts (commonly in Oracle and some other systems). It’s different in structure because it uses bit arrays (bitmaps) to represent the existence of a value for a row. For example, for a column with values {'Y','N'}, a bitmap index might maintain one bit array for 'Y' and one for 'N', with each bit position corresponding to a row. Bitmap indices are efficient for columns with low cardinality (few distinct values) and for combining multiple conditions, because bitmaps for different indices can be combined using bitwise AND/OR operations to satisfy complex WHERE clauses on multiple fields. They also implicitly index NULL values (in Oracle, bitmap indices even include nulls, so queries checking for NULL can use them).

4. Full-text indices: Regular indices are inefficient for searching within large text fields, such as finding words in a document or product description. Full-text indices are specialized inverted indices designed for text search. An inverted index flips the data inside out: It stores a list of all the keywords (terms) and, for each term, a list of documents or rows that contain that term. This is akin to the index of a book, which for each word lists pages where it appears. Full-text indices handle word-based queries, phrase searches, and more advanced text search features such as ranking results by relevance. Different databases implement this differently. For example, MySQL has a FULLTEXT index type (particularly effective with MyISAM/InnoDB for natural language or boolean text searches), PostgreSQL uses GIN or GiST indices under the hood for its full-text search (via the tsvector mechanism), and SQL Server has a separate Full-Text Engine.

5. GiST indices (Generalized search tree): GiST is an extensible index structure provided by PostgreSQL that acts as a framework for building various specialized indices. It’s like a generalized form of B-tree that you can customize with rules for what it means for data to be “in order” or “overlap.” GiST indices are flexible and support data types that don’t have a natural linear order. The most common use of GiST is for geospatial data (GIS) and geometric shapes; for example, finding whether two polygons intersect, or finding all points within a region. PostGIS (the geospatial extension for PostgreSQL) uses GiST indices for spatial objects. GiST can also be used for full-text search with the “text search GiST” or for JSON containment queries, though GIN is more common for text (see below). One benefit of GiST is that it can handle overlap queries, such as “does this interval overlap with that interval” or “is this point inside this bounding box,” which a plain B-tree cannot do efficiently.

6. GIN indices (Generalized inverted index): GIN is another PostgreSQL-specific index type, intended for cases where each row’s column value contains multiple components that need indexing. Think of it as an inverted index for composite values. The canonical example is full-text search: A document contains many words, so a GIN index on a text vector indices each word and points to the rows that have that word. Similarly, for columns that are arrays or JSON objects, a GIN index can index each element or key for lookup. With a GIN on a JSONB column, you can query for rows that contain a certain key or value. GIN indices are powerful for containment searches such as “does this array contain X?” and “does this document contain word Y?”.

7. Other specialized indices: Some databases have additional index types or variations worth mentioning:

  • Clustered vs. non-clustered: This isn’t a separate data structure, but rather how the index relates to the table data. A clustered index means that the table’s rows are stored on disk in the order of the index key. In other words, the index dictates the physical order of rows. For example, in SQL Server, a table can have at most one clustered index, often on the primary key, and the table is organized by that index. In MySQL InnoDB, the primary key is always the clustered index, so InnoDB tables are physically ordered by the primary key. Clustered indices make range queries on the clustering key fast because the rows are stored contiguously in that order, and they also speed up retrieving data when the index is in a commonly accessed column. A non-clustered index is a typical secondary index that is stored separately from the table and contains pointers, such as row IDs or primary keys, to the actual table rows. Non-clustered indices don’t affect row storage order, and you can have many of them on a table. Clustered indices are best studied for columns that determine how you frequently access data in ranges, such as clustering on a date for a log table to get all entries in date order. Every table should have a clustered index if the engine supports it, or the table is a heap without one. Non-clustered indices are the standard indices you add to support various queries. We usually automatically get a clustered index by choosing a primary key that most engines cluster on, so index selection typically refers to the additional non-clustered indices you create. 

  • Unique indices: Any index can be declared UNIQUE, which means it enforces uniqueness of the indexed key across the table. Unique indices help preserve data integrity by having no duplicate values in the indexed column or column combo, and the optimizer also knows that at most one row will match for a given key, which can slightly improve plan choices. Use unique indices to enforce natural keys or ensure no duplicates, such as an email address field that must be unique. If a query is always expecting a single row, such as looking up by a unique username, a unique index is appropriate.

  • Partial (filtered) indices: This is an index built on a subset of rows that satisfy a condition. We cover this in detail in a later section, but in short, use partial indices to index only the portion of data you care about, such as only the active record, to save space and maintenance, especially when most of the table is not queried often. 

  • Function-based or expression indices: Many databases allow indexing based on the result of an expression or function, not just raw columns, such as an index on LOWER(customer_name) to make case-insensitive search easier, or an index on a computed column. If queries apply a function to a column, such as WHERE LOWER(name) = 'john', a normal index on name won’t be used, because of the function. An index on the expression solves this. Use them to support query patterns that involve calculations or transformations.

  • BRIN (Block range index): A PostgreSQL-specific index type that is useful for huge tables where data is naturally clustered on disk, like time series. BRIN indices store summaries of values for block ranges, such as min and max of a block of 1000 rows. They are small and cheap to maintain. On large append-only tables where data has some natural order, such as timestamps, a BRIN skips blocks that don’t match a range filter, providing a lightweight alternative to a full index. For example, a BRIN on a date column in a billion-row log table lets a query on a particular month skip over blocks that contain other months. BRIN indices have low precision but low cost.

Aerospike: Up and Running

If you’re a developer, architect, or engineer looking for a distributed, resilient, scalable, high-performance application, you may be evaluating distributed SQL and NoSQL solutions.

This practical book shows you how to get the Aerospike Database up and running so you can benefit from its unlimited scalability and extremely low latency. After reading this book, you’ll be able to build applications that can process up to tens of millions of transactions per second for millions of concurrent users on any scale of data.

Trade-offs and costs of indexing

While indices make reads and queries faster, they come with trade-offs. It’s important to understand these costs to make better decisions about index selection.

  • Write performance penalty: Every additional index on a table means extra work when the data changes. For an INSERT, the database must not only insert the row into the table but also insert a corresponding entry into each index. For an UPDATE, if the indexed column is changed (or in the case of a clustered index, any change to the row can mean a physical relocation), the index must be updated. Similarly, deletes remove entries from indices. This overhead can slow down write-heavy workloads. For example, if a table has 10 indices and you insert a row, that’s 10 index insertions. If it had only indices, it would have only two index insertions. Thus, there’s a direct trade-off: faster reads at the expense of slower writes. In high-throughput OLTP systems, too many indices can become a bottleneck on insert/update performance. A general rule is to index just enough to satisfy queries without slowing down write speeds too much. If you notice that insert latencies or update latencies are high, consider whether some indices are necessary.
     

  • Storage overhead: indices consume disk space, as well as memory when cached. An index is essentially a copy of certain columns, plus pointers, in a sorted structure. The more and larger the indices, the more storage they take up. If you have a big table and create several multi-column indices, you could increase the storage footprint several times. This also affects backup size and replication. Each index might be anywhere from a few percent of the table size to even larger than the table, if indexing wide text columns, for instance. Also, if the database has limited memory, heavy indexing can pressure the buffer pool because index pages need caching along with table pages.
     

  • Maintenance and fragility: Indices can become fragmented over time, especially B-tree indices with lots of random inserts/deletes. Fragmentation means the logical order of pages doesn’t match the physical order well, leading to less efficient range scans. Some databases require or benefit from maintenance work such as periodic index rebuilds or reorganization to maintain performance (e.g., SQL Server, you might rebuild or reorganize indices if fragmentation goes above a certain percentage; Oracle and PostgreSQL are generally more self-managed, but still, fragmentation can occur). Index statistics also need to be updated; if data distribution changes and stats are stale, the optimizer might make bad decisions, such as using an index when a table scan would be faster. So you may need to schedule ANALYZE or index stats refreshes.
     

  • Locks/contention on writes: In some systems, index updates can cause extra locking, which can increase contention in a write-heavy environment. For instance, in MySQL with InnoDB, inserting into a table with multiple unique indices requires locking those index entries to check for uniqueness, which can cause contention under high concurrency. In Oracle, the noted issue is with bitmap indices on OLTP – they essentially lock a range of rows due to the nature of bitmaps, causing contention.
     

  • Diminishing returns and query planning overhead: If you have many indices, each time a query comes, the optimizer has more choices to consider. There is a slight CPU overhead in planning when many indices exist, because the optimizer must evaluate which index is best, or whether to use an index at all. However, this is usually minor unless you have many indices or complex plans. Also, having multiple indices could sometimes lead the optimizer to pick a suboptimal one if stats mislead it, whereas a leaner indexing might avoid that scenario. There’s also the concept of “diminishing returns”: The first few indices on a table address the major query patterns. Adding more yields less benefit. Eventually, they might not even be used, but would still cost you on writes and storage, with no benefit.

    For example, suppose you have a log table and you add an index, just in case, on a column that is rarely queried. You’ve now slowed down every insert into the log due to the extra index maintenance and used extra space, for a benefit that might help only 1% of queries, if any. In such a case, it might have been better not to index and deal with a slower ad-hoc query occasionally, rather than penalize every insert. This is the kind of judgment call database administrators make frequently.
     

  • Over-indexing pitfalls: Having too many indices or indexing too many columns leads to wasted space and maintenance overhead. It’s important to review indices periodically and remove those that are not providing enough benefit. Many databases let you find unused indices (e.g., Oracle has a usage monitoring, PostgreSQL’s pg_stat_user_indices shows the number of index scans, SQL Server’s DMVs can list indices that haven’t been used since the last reboot). Dropping unused indices reclaims performance and space. 

Tools and automation for index selection

Today’s databases and third-party tools provide several ways to assist with index selection and tuning. These can help recommend indices, find unused ones, or manage indices. Here are some notable tools and features across popular systems:

  • Aerospike:

    • Choosing which secondary indices you create: Because Aerospike is built for primary-key lookups, every record is already reachable through an in-memory (or on-flash) primary index. So if your application needs to search on anything other than the primary key, such as a user-ID bin, a timestamp range, or a value inside a list/map, you create a secondary index on that bin (or CDT path). Each secondary index adds memory (or flash) overhead and some write latency, so “index selection” means deciding to index only the bins that your queries need. Best-practice guides advise creating an index when you perform frequent non-primary key lookups and skipping it when you do one-off scans.

    • Selecting the storage location for the primary index: In Enterprise Edition, you can set index-type to shmem (keep the primary index in RAM) or flash (store it on NVMe SSD). For very large datasets, putting the primary index on flash cuts RAM requirements by an order of magnitude while still delivering sub-millisecond reads; for smaller, latency-critical workloads, you keep it in memory. That configuration choice is also a form of “index selection” because it affects cost vs. speed tradeoffs. As an example, 4 billion objects with RF=2 need about 477 GB of RAM for an in-memory index, but only about 2.5 GB if the index sits on SSD.

    • Things that aren’t really an issue: Aerospike’s query engine doesn’t do sophisticated “cost-based” index picking the way a relational database management system does; a query contains at most one filter that maps directly to one secondary index, so the application controls which index gets used. Also, there’s no need to worry about composite B-tree vs. bitmap vs. hash planners; the primary index is always a distributed hash table, and secondary indices are simple inverted indices under the hood. 

    • When you do need to think about index selection:

      • Design time: decide which bins (or list/map keys) warrant a secondary index so your queries stay fast without wasting memory

      • Cluster-config time: pick index-type flash or shmem based on dataset size and latency budget

  • MySQL:

    • EXPLAIN and performance schema: MySQL’s first line of index tuning is using EXPLAIN to see how queries use indices. In addition, the Performance Schema (and the sys schema views) help identify which indices are not being used. For example, MySQL’s sys.schema_unused_indices view (if enabled) lists indices that haven’t been used since the server started. This can guide you on which indices might be safe to drop.
       

    • MySQL Workbench index advisor: MySQL Workbench has a “Performance Reports” section with advisors that may flag items such as missing or redundant indices. It’s not advanced, but it can catch obvious issues. For example, table scans on large tables might prompt “consider adding an index on column X” if the query is known.
       

    • Third-party tools: Percona Toolkit offers scripts such as pt-query-digest to find slow queries and pt-index-usage, which can match queries against indices to see which indices are never used by any query in a log. Also, services such as EverSQL, an online advisor, analyze a query and suggest indices or rewrites. 

    • MySQL Enterprise Monitor: If you have access, it includes advisors that suggest adding indices for certain query patterns and also suggests removing duplicates.

  • PostgreSQL

    • Auto Explain and pg_stat_statements: EXPLAIN ANALYZE and the pg_stat_statements module identify slow queries and see whether they are using indices, though PostgreSQL won’t specifically recommend creating an index.
       

    • Hypothetical indices: The HypoPG (Hypothetical indices) extension lets you simulate an index’s presence and see whether the planner would use it, without actually creating it on disk. This is helpful for testing potential indices in production systems with minimal overhead.
       

    • Index advisor extension: As cited above, the community has developed the index_advisor extension, which is used by tools like Supabase’s studio. This extension can take a given query and output recommended index statements. Essentially, it analyzes the query predicate and suggests an index that would reduce the cost.
       

    • pganalyze and other tools: pganalyze, a SaaS tool for Postgres tuning, has an Index Advisor feature that suggests indices to create based on a workload or a specific query. It uses knowledge of the Postgres planner to simulate and compare costs.

    • Automatic indexing research: PostgreSQL doesn’t yet have built-in autonomous indexing, but tools are being developed, such as prototyped background services that create indices based on usage patterns. For now, database administrators (DBAs) use the above methods.

  • SQL Server: 

    • Database engine tuning advisor (DTA): This is a well-established tool that comes with SQL Server and also works with Azure SQL. With a trace or a set of queries, it will recommend indices, indexed views, and partitioning to improve performance. It provides a report of what to add and the estimated improvement. DBAs often use it as a starting point, though one must be cautious and not blindly create everything it suggests, because it may suggest many indices that could hurt write performance.
       

    • Missing index DMVs: SQL Server’s execution engine keeps track of “missing index” opportunities. When it compiles a query plan and analyzes it, it could benefit from an index, and it logs that in DMV (Dynamic Management Views). Specifically, sys.dm_db_missing_index_details and related DMVs can be queried to find which indices the engine has analyzed that could be helpful, along with an “impact” estimate, based on past query executions. For example, if a particular table had a lot of scans,  SQL Server could analyze that an index on (columnA, columnB) would have reduced cost by X%. Many graphical user interface tools report this information, and even the graphical execution plan in Management Studio often shows missing index recommendations with green text.
       

    • Automatic tuning: Azure SQL Database and Azure Managed Instance have an automatic indexing feature that creates indices it deems beneficial and drops those that are unused or degrade performance. It monitors query performance before and after to validate the benefit. This is essentially Microsoft’s cloud implementation of set-and-forget index tuning.

    • Third-party and monitoring: Tools such as SentryOne or SolarWinds have index analysis features; also, Microsoft’s SQL Server Management Studio has a Database Tuning Advisor (basically DTA in GUI form).

  • Oracle

    • SQL Tuning Advisor: Oracle has a tuning advisor as part of Oracle Enterprise Manager and the Tuning Pack. When you run it on a SQL statement or workload, it recommends indices along with other things such as  SQL Profile and rewrite.

    • Oracle automatic indexing:* Oracle 19c introduced Automatic Indexing. When enabled, the database will internally monitor running SQL, identify candidates for new indices, create them as invisible indices, test them on the workload, and if they consistently improve performance, mark them visible for general use. It can also drop or make invisible any auto-index that hasn’t been used for a long time, with the default being 373 days. Essentially, Oracle acts as its own DBA: every 15 minutes, it runs a task to evaluate and implement indexing changes. This feature, when licensed and enabled, saves time for DBAs in Oracle OLTP environments.
       

    • Oracle advisory views: Oracle also has views such as v$object_usage for indices to see whether they’ve been used and procedures to detect useless indices.
       

  • Other databases/tools:

    • Some NoSQL/NewSQL systems don’t have these tools because indexing is simpler or manual. But some newer cloud databases, such as AWS Aurora, are exploring automation.
       

    • Open-source tools such as PgHero and Pt-online-schema-change indirectly help by showing index usage stats.
       

    • Visualization tools: Sometimes, just visualizing the schema and indices helps. Tools such as ERWin or database diagrams can highlight indices, but that requires more design time.
       

    • Community scripts: For example, a popular script for SQL Server is sp_BlitzIndex. It analyzes indices and reports potential issues such as missing indices, duplicate indices, and high unused indices. 

In practice, a sensible approach is to use native advisors and information to get hints, but apply human judgment. For example, DTA might suggest 10 indices that each help one query by a tiny margin, but adding all 10 might degrade insert performance. Choose the top two or three that provide the most benefit.

Aerospike takes index selection to the next level

Choosing the right indices can make the difference between sluggish queries and lightning-fast applications, but the database engine itself also matters. Aerospike’s real-time data platform is designed around an in-memory primary key index combined with a distributed tree structure, resulting in O(1) look-ups and eliminating costly index rebuilds even at petabyte scale.

Beyond the primary index, Aerospike supports secondary indices and composite indices that let you satisfy complex filters and range queries without full scans. Its smart-client architecture keeps index lookups sub-millisecond across a global cluster. Because indices are stored in shared memory and data can reside on flash or SSD, you get high performance with fewer servers and lower total cost of ownership, freeing you to focus on picking the right columns and query patterns rather than fighting hardware limits.

If you’re ready to see how an architecture purpose-built for speed and scale changes the index-selection game, explore Aerospike hands-on or dive deeper with the detailed architecture whitepaper.

Try Aerospike: Community or Enterprise Edition

Aerospike offers two editions to fit your needs:

Community Edition (CE)

  • A free, open-source version of Aerospike Server with the same high-performance core and developer API as our Enterprise Edition. No sign-up required.

Enterprise & Standard Editions

  • Advanced features, security, and enterprise-grade support for mission-critical applications. Available as a package for various Linux distributions. Registration required.