All Methodology
Methodology

Entity resolution

How Atlas decides whether two scraped records refer to the same company — the deterministic ladder, the LLM-assisted disambiguation pass, and the operator-review safety net.

Inbound data arrives messy. The same company shows up as "Toray Industries", "TORAY INDUSTRIES, INC.", and "Toray Industries Co Ltd" across three different exhibitor lists. Entity resolution is how Atlas decides those three rows refer to the same canonical company.

This is also how every count Atlas publishes — exhibitor counts, churn rates, sponsor-tier counts — stays correct over time. Bad entity resolution silently inflates counts and masks churn. Atlas's resolution ladder is documented end-to-end below.

The resolution ladder

Atlas applies six tests in order, the most authoritative first. The first one to fire wins:

1. Domain match (highest authority)

Two records resolving to the same canonical domain (henkel.com, toray.com) are the same company. This rule is enforced as a database uniqueness constraint.

When records arrive without a domain (typical of scraped exhibitor lists from organizers who don't publish URLs), this rule can't fire and we fall through.

2. LinkedIn URL match

LinkedIn URLs are stable across years and present on most enriched records, so they form the next-most-authoritative match after domain.

3. Third-party enrichment ID match

When records carry a stable ID from a licensed B2B enrichment provider, they uniquify under that provider's resolution. We trust these — the providers do their own entity resolution at scale.

4. Exact-name match after normalization

The fallback when neither domain nor LinkedIn nor enrichment ID is present. Atlas computes a normalized name by:

  • Lowercasing
  • Stripping legal-form suffixes (co, ltd, corp, corporation, inc, gmbh, sa, kg, kgaa, llc, plc, pte, sdn bhd, pvt, holdings, group, enterprise(s), incorporated, company)
  • Stripping punctuation (. , ' " & / \ ( ))
  • Collapsing whitespace
  • Stripping accents (NFKD normalization)

Two records normalising to the same string merge. This layer handles TORAY INDUSTRIES, Toray Industries Inc., and TORAY INDUSTRIES, INC. correctly without false positives.

5. LLM-assisted disambiguation (the borderline band)

Rule 4 is exact-match-after-normalization. Rule 5 handles the band where names are close but not identical: MITSUBISHI ENGINEERING-PLASTICS COPR (typo) vs MITSUBISHI ENGINEERING PLASTICS CORP. Trigram similarity flags pairs in the 0.6–0.85 window — high enough to suspect a match, low enough that simple normalization can't confirm.

A borderline pair is reviewed by an LLM (Claude) with structured output asking: are these the same entity, with what confidence, and what's the reasoning? A pair merges only when the answer is yes with confidence ≥ 0.8.

Every LLM decision is stamped to an audit log alongside the reasoning so a merge can be reversed if needed.

6. Operator review (manual)

Pairs the automated layers can't resolve land in an operator merge queue. The operator sees both records side by side and can click to merge, with the moves recorded in the admin audit log.

What gets merged when two rows resolve to the same entity

The canonical pick is the row with the higher completeness score — a weighted sum of how much we know about each candidate (presence of enrichment payload, presence of a domain, presence of an HQ country / city, contact count, presence of structured industry / size data).

Ties break on contact count → earliest record creation → smallest internal ID. Stable across runs.

The merge action:

  1. Persons whose employer was the sibling now point to the canonical company
  2. The sibling row is preserved as a subsidiary-shaped record (so we don't lose its history)
  3. Both rows get a merge stamp with the run timestamp + IDs for reversibility
  4. Idempotent: re-running the pass skips groups whose canonical is already the parent

Why we don't auto-merge across countries by default

A naive entity-resolution system would merge Toray Industries (Japan) with Toray Carbon Fibers Europe (France) because the names overlap heavily. Atlas does not. Toray Carbon Fibers Europe is a distinct exhibiting unit — it has its own booth presence, its own contact set, its own sales territory. The resolution ladder treats them as separate canonical entities until a higher-authority signal (same domain, same LinkedIn URL) says otherwise.

This is by design. The cost of false-positive merges (wrong contacts on the wrong record, lost subsidiaries) is much higher than the cost of false-negatives (two rows that could in principle be merged stay separate).

Cite this

Atlas Methodology, "Entity resolution", /methodology/entity-resolution.

Cite this as: Atlas Methodology, “Entity resolution”, /methodology/entity-resolution.