Introduction to Union Find

Ten million friendship events stream in, one pair at a time: “Alice and Bob just became friends.” Interleaved with them are millions of queries: “Are Carol and Dave in the same friend circle right now?” You must answer each query in effectively constant time — while the circles keep merging underneath you.

Rebuild a graph and run BFS for every query? That’s O(V + E) per query — quadratic over the whole stream, hopelessly slow. Keep a group_id on each person and rename a whole circle on every merge? Also quadratic. The data structure on this page answers both the merges and the queries in essentially O(1) each. Let’s build it.

A disjoint-set data structure maintains a collection of non-overlapping sets. Each element belongs to exactly one set. The structure supports two operations:

find(x) — return a representative of the set containing x. Two elements are in the same set if and only if their find returns the same value.
union(x, y) — merge the sets containing x and y into a single set.

That’s the entire API. Two operations. Everything else is a clever choice of representation.

The partition view

Think of the universe as a bag of n items, initially each in its own group:

{1}  {2}  {3}  {4}  {5}  {6}  {7}  {8}

As events arrive (edges in a graph, friend pairings, equality assertions), you union groups together:

union(1, 2):  {1, 2}  {3}  {4}  {5}  {6}  {7}  {8}
union(3, 4):  {1, 2}  {3, 4}  {5}  {6}  {7}  {8}
union(1, 3):  {1, 2, 3, 4}  {5}  {6}  {7}  {8}
union(5, 6):  {1, 2, 3, 4}  {5, 6}  {7}  {8}

At any point you can ask “are x and y in the same group?” — that’s find(x) == find(y).

A DSU is a partition tracker. Items only get merged, never split. (There exist structures that support split too — link-cut trees — but they’re far more complex and rarely interview material.)

Why not just use a hash map?

Naive idea: maintain a dict group_id[x] = id. To union(x, y), rename every element of y’s group to use x’s id.

That’s O(group size) per union. For n operations, the total cost is O(n²) in the worst case (a long chain of single-element unions). For small inputs it’s fine; for any serious workload it falls over.

DSU’s brilliance is using a tree instead of a flat dict — and choosing the tree structure carefully so that both operations stay fast.

The forest representation

Represent each set as a tree:

Every element points to a parent.
The root of a tree is the set’s representative (its parent is itself).
find(x) walks parent pointers up to the root.
union(x, y) makes one root the parent of the other.

parent: [0, 0, 1, 1, 4, 4, 6, 6]
            ↑   ↑      ↑      ↑
        root=0  child=2     root=4 (lone tree)

Trees:
    0           4         6
   / \         / \         \
  1   ...     5  ...        7
 /
2
 \
  3

Note: parent arrays are the most common encoding. parent[i] = “who’s i’s parent?” Initially parent[i] = i for every i (each element is its own root, in its own singleton set).

The trees here have nothing to do with binary search trees. They’re not sorted; they don’t represent anything semantic. They’re a structural choice that makes the partition-tracking operations cheap. Children of the same parent are the elements that happened to be unioned with it — order is incidental.

What operations should be fast?

You’ll typically run a mix of operations:

Predict first

Between find and union, which one runs MORE often in a typical DSU workload — and why does that decide which one we optimize hardest?

Operation	How often you do it
`union(x, y)`	Once per “merge” event
`find(x)`	Twice per union, plus standalone queries
Count components	At the end, or after each event

find runs more than union — every union calls find twice (once for each endpoint). Making find fast is the priority.

Naive find walks parent pointers up to the root. If the tree is a long chain (worst case after a bad sequence of unions), that’s O(n) per call. Add n operations, and you’re back to O(n²).

The two optimizations we’ll see in the next two pages — union by rank and path compression — together flatten the tree so aggressively that the amortized cost per operation is O(α(n)), where α is the inverse Ackermann function. For any n smaller than the number of atoms in the universe, α(n) ≤ 5. It’s effectively constant.

A pattern-spotting checklist

Is the problem about groups / equivalence classes?

“Are X and Y in the same group?” “How many groups are there?” “Does adding this connection merge two groups?” Strong DSU signals.

Do I only need to ADD edges, never remove them?

DSU is great at edge insertion (union). It’s bad at edge deletion (the partition can’t split easily). If the problem allows or requires deletions, you may need offline trick (process queries in reverse) or a different structure.

Do I need actual paths, or just connectivity?

DSU tells you “yes, connected” or “no, not.” It does NOT give you a path. If the question needs an explicit path (shortest, all, etc.), use BFS/DFS instead.

Is the input arriving as a stream of pairs?

Edges in a graph. Equality assertions in a constraint problem. Friendship events. Whenever pairs of items keep arriving and you need to track which clump they form, DSU is the answer.

Could the problem be re-framed as cycle detection?

“Is this graph a tree?” “Does adding this edge create a cycle?” “Are there duplicates in this constraint system?” All map onto DSU’s “the two endpoints are already in the same component” check.

Quick check

What does find(x) return in a disjoint-set data structure?

Why can't we naively use a hash map from element → group_id?

A DSU is great at adding edges (unions). When is it the wrong tool?

This is the first non-traversal way we’ve tracked graph connectivity: where Day 10’s BFS/DFS walk a graph to discover components, DSU maintains them incrementally as edges arrive — which is exactly what the streaming hook above demands. Next: the DSU template — the naive parent-array code, and the degenerate input that makes it dangerously slow.

Overview The DSU Template

Finished this page?