Record linkage demo

Record linkage with active learning

Deduplicate a file or link two files with far fewer labels. Instead of labeling random pairs, the system actively asks for the next pair that will teach the matcher the most—so you reach high-quality match probabilities quickly.

Early on you’ll see sim (raw similarity). After a few labels, you’ll see prob (calibrated match probability) and the queried pairs concentrate near p≈0.5.

Upload CSV(s)
Label 10–30 pairs
Run full scoring + download results

How it works

Compute many string-distance similarity features from record text (e.g., a text/name column).
Compress features with PCA into a small embedding (principal components).
Fit ridge logistic regression on PCs to predict match probability.
Actively choose the next pairs to label to reduce uncertainty fastest (often near p≈0.5).
Query strategy: start with diverse pairs, then pick batches that maximize expected information gain (entropy reduction).

As labels accumulate, the backend can increase capacity by using more PCs.

Hosted API demo

This page calls a hosted backend API. Uploads are processed for your session and used only to fit the in-session matcher (not for training outside your run). Data may be cached temporarily to generate outputs and is not intended to be retained as a dataset.

The backend is efficient and designed to run in a tiny compute footprint (CPU-only).

If you need a private deployment or a higher-throughput trial environment, email joe@josephsmiller.com.

Upload

API token (optional)

This server requires a token. Request a demo token: email me.

For a hosted trial, request a demo token: email me.

Left CSV (required) Right CSV (optional)

Sample dataset Download sample left CSV Download sample right CSV

CSV format (hosted demo): the backend uses a text column if present, else name, else the first column, and ignores other columns. Tip: concatenate multiple fields into a single text column (e.g., name + address) for better matching.

Sample datasets are optional — you can upload your own CSVs instead (subject to demo limits enforced by the backend).

Label

Batch size Auto-suggest threshold (prob) ?

Suggestions only apply to high-confidence probabilities. Suggested labels are highlighted until you review the pair; uncertain pairs are left blank.

Score shown

Pool remaining

PCs used

Variance explained

Compared fields: this hosted backend compares the text/name column (or first column). Other deployments may compare multiple columns field-by-field.

Keyboard shortcuts

j/k or ↑/↓: move • a: match • s: skip • d: no-match • x: clear • Enter: submit • Shift+N: next batch

Results

Predictions = scored candidate pairs (with probabilities once you’ve labeled a few pairs). Matches = linkage-mode pairs above the backend match threshold. Clusters are generated only for dedup mode (linkage mode returns matches instead).

predictions: review scores and choose a cutoff for your workflow
matches: linkage-mode “accept list” (mutual best matches above the backend threshold)
clusters: dedup-mode group IDs produced from clustering the scored similarities

Match labels

No-match labels

Metrics update as you label batches. Use “Run full scoring” to generate downloadable outputs.

Questions or want a private deployment / higher-throughput environment? Email me.