Scarf tracking pixel
Webinar - July 10: Cut Infra Costs by 80% with Smarter DRAM, NVMe & Cloud Storage StrategyRegister now
Blog

What is an identity graph?

Discover how identity graphs link emails, device IDs, and more into one profile, drive real-time personalization, and how Aerospike scales to petabytes.

January 30, 2025 | 16 min read
Alex Patino
Alexander Patino
Content Marketing Manager

An identity graph is a structured database that links the many identifiers a single person or household generates, such as email addresses, device IDs, loyalty numbers, cookies, and phone numbers, into one persistent profile. Marketers and business strategists use the resulting profile to recognize the same consumer across websites, mobile apps, connected TV, in-store systems, and call centers. By stitching together these diverse data points, an identity graph allows a brand to treat “jane_doe@email.com” on a laptop, “A1B2C3” on a smartphone, and a credit-card token in a point-of-sale system as one customer rather than three unrelated records.

The underlying architecture ingests first-party customer data from customer relationship management systems (CRM), customer data platforms (CDP), and data warehouses, as well as second- and third-party feeds such as partner tags or publisher IDs. Each new record contributes additional identifiers or behavioral data. The system continuously calculates links, some confirmed while others are inferred with statistical confidence, and stores the resulting clusters in a graph structure. Marketers query this graph to suppress existing customers from acquisition campaigns, build look-alike audiences, or sequence messaging based on prior interactions.

Because the graph operates in near real time, a site visit with a new browser cookie can trigger an immediate lookup: If the cookie maps to the same hashed email captured at checkout yesterday, the visitor sees a loyalty offer instead of a generic ad. This identity resolution process reduces media waste, improves measurement, and supports personalization without exposing raw personally identifiable information.

Identity resolution

Identity resolution turns raw identifiers into usable customer profiles. Each incoming record passes through a normalization layer that standardizes formats (such as lower-casing email addresses and stripping special characters from phone numbers). Next, matching logic evaluates whether the new identifier already exists in the graph. A confirmed match merges the data; a non-match creates a new node. Over time, the system revisits older nodes, merging or splitting them as additional data clarifies relationships.

Effective identity resolution balances three goals: accuracy (linking only the correct identifiers), reach (covering as many customer touchpoints as possible), and privacy (not exposing sensitive data or violating consent terms). Brands often deploy hashing, tokenization, or clean-room environments to keep raw customer data inaccessible while still allowing deterministic matching behind the scenes.

Ready to put identity graphs into action? Discover how Aerospike improves your identity resolution and personalization strategies. Learn more about our solutions.

Probabilistic vs. deterministic matching

Probabilistic matching uses statistical models to decide whether two identifiers belong to the same person. For example, a mobile ad ID and a browser cookie that share IP address ranges, location patterns, and similar browsing times may be considered a >90 % probability match. This technique increases reach when direct identifiers are unavailable, though it introduces the risk of false positives.

Deterministic matching relies on explicit, unique identifiers such as hashed email addresses, login names, or loyalty numbers. When two records carry the same identifier, the match is asserted with near-perfect certainty. Deterministic links are more accurate and are generally preferred for sensitive use cases such as attribution or customer support, but they demand strong first-party data and user authentication events.

Most enterprise-grade identity graphs combine both approaches. A deterministic spine forms the foundation; probabilistic logic fills the gaps where users remain anonymous. Modern platforms also assign confidence scores to every edge in the graph, letting marketers decide whether to restrict an activation to deterministic links only or include high-probability probabilistic connections.

Maintaining an identity graph is an ongoing process. Data governance teams audit match rules regularly, adjust probabilistic thresholds as privacy regulations evolve, and purge stale identifiers to prevent drift. When executed responsibly, the graph serves as the connective tissue across MarTech and AdTech stacks, providing consistent customer experiences while complying with privacy regulations such as GDPR and CCPA.

Benchmarking real-time graph performance at scale

From millions to billions: How Aerospike Graph delivers speed, scale, and cost-efficiency for AdTech identity workloads. Identity resolution at scale is difficult. Query latency spikes, infrastructure bills balloon, and most graph databases break before touching billions of data points in motion. This benchmark shows how Aerospike Graph overcomes these limitations.

How does an identity graph work?

So those are the basics. Let’s look at the workflow in more detail. 

The workflow can be broken into four stages:

  1. Data onboarding
    • Raw records arrive with dozens of identifiers: email addresses, device IDs, cookies, loyalty numbers, hashed credit-card tokens, IP addresses, mailing addresses, and more.
    • The graph normalizes formats, removes obvious duplicates, and assigns each record a timestamp to support recency scoring and decay logic.

  2. Identifier stitching
    • All identifiers are compared to existing nodes in the graph.
    • Deterministic matching connects records when two identifiers are confirmed to belong to the same entity. For example, the same hashed email address may appear in both a CRM entry and a website login.
    • Probabilistic matching uses statistical models to link records that are highly likely to belong together when deterministic evidence is missing. For example, the same device ID, a shared IP, and overlapping behavioral patterns may deliver a 92% confidence score.

  3. Graph maintenance
    • Each newly resolved profile is assigned a persistent internal ID.
    • Continuous feedback loops update attributes such as last-seen device or preferred channel.
    • Governance rules enforce data-retention limits and respect opt-out requests to follow privacy regulations.

  4. Activation and measurement
    • Authorized applications request segments or single profiles through an API or CDP connector.
    • For privacy, the system returns the least amount of data possible for that request. For example, an ad network receives device IDs, while an email platform receives hashed email addresses.
    • Marketers run cross-channel personalization, look-alike modeling, or multi-touch attribution without re-collecting data.

Key mechanics and considerations

There are typically trade-offs to help you decide which method is better for your particular example. These include:

Precision versus scale: Deterministic matching offers higher accuracy but smaller reach; probabilistic matching gets you more matches but could result in more false positives. Many organizations adopt a hybrid approach with tiered confidence levels.

Real-time versus batch: Streaming identity resolution supports onsite personalization within milliseconds, while nightly batch jobs suffice for reporting and offline campaigns.

Privacy controls: Replace personal identifiers with pseudonyms, add noise to the data so individual users can’t be reverse-engineered, and maintain regional data stores to honor data-sovereignty laws.

Conflict handling: When two records contain contradictory demographic attributes, such as different birth years, resolve the conflict by prioritizing one data source over another, or by using machine learning weighting. Either way, make sure the decision is logged in case it needs to be reviewed or corrected later.

Points of debate

Vendors in this field offer support for accuracy metrics such as match rates (how often two records are correctly matched), false-merge rates (how often two records are incorrectly matched), and split rates (how often a single entity is incorrectly identified as multiple ones). However, these terms lack industry-wide standards, making it difficult to compare vendors.  

Another factor is that some privacy advocates argue that using probabilistic matching could end up re-identifying users even after companies have removed or masked data in an effort to hide their identities. On the other hand, some advertisers insist it’s necessary because some browsers now restrict third-party cookies. Regulators in the EU and several U.S. states are evaluating whether certain probabilistic techniques constitute profiling under data-protection law.

Outputs

A healthy identity graph delivers:

• A deduplicated customer count, which helps provide more accurate financial forecasting because businesses can better estimate customer value and spending.

• Addressable audiences across email, paid media, and onsite channels, which improves marketing efficiency by reaching the right user through the right channel.

• Cross-device frequency capping to limit the number of times a user sees the same ad across different devices, which reduces ad fatigue and waste.

• Cohort-level insights that pool users with shared traits, to use for product recommendations and loyalty program personalization.

Webinar: Achieving the perfect golden record with graph data for identity resolution

Join experts from the AWS Entity Resolution team, Lineate, and Aerospike as we discuss how entity resolution and identity graphs can help achieve the perfect golden record.

Identity graph examples

Identity graphs turn raw identifiers into actionable insights that marketers can apply across the funnel. The most common use cases fall into four categories: customer journey mapping, audience management, personalization, and data privacy compliance.

Customer journey mapping

An identity graph stitches device IDs, email addresses, loyalty numbers, and offline purchase records into one profile. With that profile, a company’s market staff can:

• Detect when a prospect moves from anonymous browsing to logged-in purchase, closing attribution gaps.

• Sequence messages so that touchpoints in email, mobile app, and in-store promotions build on one another instead of repeating the same offer.

• Measure incremental impact by comparing customers exposed to specific channels against matched control groups.

Audience management

Media budgets stretch further when segments are built on unified profiles rather than isolated cookies or mobile advertising IDs. Identity graphs support audience management by:

• Building high-value segments such as “in-store purchasers who have not opened the app in 30 days” and activating them across DSPs, social networks, and owned channels.

• Suppressing existing customers from acquisition campaigns to reduce paid-media waste.

• Extending reach through look-alike modeling on top of deterministic data, improving scale without sacrificing accuracy.

Personalization

Real-time profile resolution lets marketers serve contextually relevant content and offers:

Web personalization: The identity graph recognizes a returning shopper even if cookies have been cleared, allowing dynamic hero banners based on past browsing.

Email and push: Lifecycle triggers, welcome series, replenishment reminders, and loyalty tier upgrades come from one event timeline rather than separate channel silos.

Omnichannel recommendations: Product engines draw from the graph’s consolidated behavioral and transactional attributes, increasing average order value.

Ready to see how identity graphs can transform your customer engagement? Discover how Aerospike's real-time solutions can improve your personalization strategies.

Data privacy compliance

Regulations such as GDPR and CCPA require accountability for every identifier. Identity graphs provide:

• Centralized consent management with preferences captured in any channel, updating the master profile, which means downstream activations respect opt-outs.

• Audit trails that show when and where personal data entered the system, which helps satisfy data subject access requests.

• Granular data retention policies, with identifiers that can be anonymized or deleted at the profile level without breaking the links that support non-personal analytics.

Trade-offs and limitations

While use cases are compelling, two concerns regularly surface:

Accuracy vs. reach: Deterministic matches (exact email, login, or hashed phone) deliver high precision but smaller audiences. Probabilistic techniques (statistical linking of IP, device graphs, or behavioral patterns) widen reach but introduce match errors that can dilute campaign performance.

Privacy impact: Rich profiles heighten the risk of re-identification. Brands must balance personalization gains against consumer expectations and regulatory scrutiny.

Privacy-preserving techniques, such as differential privacy (adding noise to data to protect individuals), keeping data on the user’s device rather than sending it to the cloud, and clean rooms (secure environments where parties can analyze combined data without seeing each other’s raw data) mitigate risk but may lose some of the detail and increase costs.

Choosing the right use case

A phased rollout reduces both complexity and compliance risk. Early adopters often start with suppression and basic journey mapping because they deliver quick ROI and require limited data attributes. As governance frameworks mature, teams expand to full cross-channel personalization and advanced audience modeling. Continual measurement is essential; lift studies and hold-out tests verify that identity-based tactics outperform legacy cookie-based approaches.

Operational checklist

  • Define clear success metrics for each use case before activating data.

  • Classify identifiers (PII vs. pseudonymous) and map consent status.

  • Integrate a privacy impact assessment into every new audience or trigger.

  • Establish feedback loops, because conversion data flowing back into the graph improves future match accuracy and segment quality.

  • Keep data governance and marketing operations synchronized; misalignment causes either compliance failures or lost revenue opportunities.

By aligning identity graph capabilities with specific objectives, whether closing attribution gaps, improving media efficiency, tailoring experiences, or strengthening compliance, organizations turn fragmented data into a competitive asset while respecting consumer trust.

How to build an identity graph

Map use cases and success metrics
Begin by pinning down the decisions that the graph must support, such as cross-device attribution, fraud detection, omnichannel personalization, or a new CDP module. Document latency requirements, scale targets, service-level objectives, and compliance constraints. Without a scope, later phases such as data modeling and identity resolution rules drift, and the project needs to be reworked.

Inventory and profile data sources
List every location holding customer-relevant identifiers: CRM tables, point-of-sale logs, web analytics events, mobile SDK payloads, call-center records, loyalty programs, and advertising pixels. For each source, capture: identifier types (email, phone, cookie, device ID), cadence, volume, data quality scores, and legal basis for processing. Profiling scripts that compute null rates, pattern mismatches, and cross-field uniqueness speed later matching decisions.

Design the data model
Create an entity-relationship diagram that places the person or household at the center. Use edge tables to record observed relationships between identifiers (for example, hash_email → customer_id, cookie_id → browser_fingerprint). Keep observation timestamps separate from assertion timestamps so queries return consistent, reproducible results. Adopt a slowly changing dimension pattern in the data warehouse to track descriptive attributes over time. At this stage, revisit privacy requirements; if regional regulations require you to collect less data,  drop attributes you don’t need, or convert them into anonymized tokens.

Build ingestion pipelines
Either stream or batch-load the profiled sources into a raw zone to collect the data, then use schema registry tooling to see whether fields have been added that could break processing. Apply format-specific parsers, such as JSON Flatten, Avro, and CSV, to convert feeds into columnar files for more efficient querying in the warehouse, and put the converted data into the warehouse’s staging schema. Add extra information, such as the campaign name or app version, to provide more context to distinguish or link user activity.

Implement identity resolution logic
Deterministic rules rely on stable identifiers (login email, loyalty number) and are more precise.  Probabilistic algorithms use similarity functions, such as Levenshtein distance, Jaro-Winkler, and IP-based co-visitation to cluster records when deterministic links are missing. Configure a tiered approach: run deterministic joins first, then probabilistic clustering on the residual unmatched set. Store confidence scores alongside each resolved link so applications that use the data can choose their own precision–recall threshold.

Persist the graph in the data warehouse
Choose a storage pattern that supports both graph-style traversals and large table scans. Two popular approaches:

• Adjacency list tables keyed by primary_id with an array of linked identifiers
• Columnar star schema where fact_identity_edge(table_a_id, table_b_id, match_type, confidence) records every edge

Partition by update_date to organize the data by the date when the data was last updated. This means only recent or changed data needs to be updated, and it lets the warehouse prune irrelevant partitions during queries. 

Create materialized views that flatten the graph structure (which is typically normalized and complex) into simpler, denormalized tables, because they’re easier for SQL to use. At the same time, keep the original, normalized edge tables intact, because iterative machine learning workflows often require the detailed, relationship-based graph data.

Make the graph’s data available through a CDP interface
If you have an existing CDP, publish a nightly export that uses the same specification it uses for adding data, such as  CSV batches or S3 manifest files. For custom or bespoke tool stacks, create REST or gRPC endpoints that take an identifier and return combined data about that identifier. At the same time, enforce rate limiting to protect against overload or abuse, monitor latency to let you know if the system doesn’t maintain performance, and make sure the API gateway forwards Data Subject Access Requests properly by forwarding these requests to the pipeline responsible for deleting or anonymizing user data.

Govern privacy and security
Encrypt data both in transit (TLS 1.2+) and at rest (AES-256). Apply role-based access control in the warehouse so that only people with approved roles can query personally identifiable information. Automate data retention with partition-level TTL jobs that purge records after country-specific thresholds expire. Maintain an audit trail of identity resolution rule changes; regulators often ask for logic transparency, not just consent records.

Establish monitoring and feedback loops
Important performance indicators include match precision, match recall, profile depth, and query latency. Create a dashboard using these metrics and set alert thresholds to let you know if problems develop. Use data about which marketing efforts actually led to conversions, such as purchases or sign-ups, and feed it into the system that probabilistically matches and links customer identifiers to adjust and improve its weights.

Periodically, A/B test the graph against a control group that uses legacy matching to check the improvement in matching accuracy or business outcomes achieved by the new graph,  and catch regressions early if changes to the graph reduce performance.

Decide on build vs. buy extensions
Open-source graph databases (e.g., Neo4j, JanusGraph) are more flexible, but you have to maintain them.  Commercial CDPs may come with identity resolution built in, but you may not be able to see the underlying algorithms. A hybrid model, with deterministic logic coded in-house and probabilistic enrichment coming from a vendor, is often both transparent enough while being faster than developing your own system. Document trade-offs so stakeholders understand the long-term cost of each path.

A graph database designed for scalability

Explore the capabilities of the graph data model, graph databases, and Aerospike’s new low-latency and highly scalable graph database offering.

Potential pitfalls and how to prevent them

• Don’t depend too much on probabilistic links, because they could be wrong. If you’re going to use it, create safeguards by setting conservative confidence thresholds, especially first-party messaging such as email messages or text messaging.

• Storing raw identifiers without hashing invites breach risks; use salted, reversible hashes to protect the data, and use plain text only when a downstream system requires it.

• Giving numerous microservices write access to the data can lead to inconsistencies and fragmentation. Instead, route all changes through one version-controlled interface.

Continuous improvement roadmap

Like so many things, setting up an identity graph isn’t one-and-done; you want to be able to improve it based on new information. Schedule quarterly schema reviews, revise resolution weights after major marketing campaigns, and benchmark warehouse performance as data volume grows. If identifier types such as third-party cookies are becoming obsolete, retire them on your own schedule before you’re forced to by a browser vendor. 

Ready to build a real-time identity graph?

Identity graphs only reach their full potential when every lookup is lightning-fast and every edge scales to billions without breaking the bank. That’s what Aerospike’s real-time NoSQL core, plus the new Aerospike Graph service, delivers: sub-millisecond queries, petabyte capacity, and strong consistency so you can merge or split identities with confidence.

Read more

Get started

Engineered for massive identity graphs, Aerospike lets you cut infrastructure costs by up to 80 percent while keeping latency under a millisecond, so every deterministic or probabilistic match you serve is fresh, accurate, and privacy-compliant.

Try Aerospike: Community or Enterprise Edition

Aerospike offers two editions to fit your needs:

Community Edition (CE)

  • A free, open-source version of Aerospike Server with the same high-performance core and developer API as our Enterprise Edition. No sign-up required.

Enterprise & Standard Editions

  • Advanced features, security, and enterprise-grade support for mission-critical applications. Available as a package for various Linux distributions. Registration required.