Record linkage with active learning
Deduplicate a file or link two files with far fewer labels. Instead of labeling random pairs, the system actively asks for the next pair that will teach the matcher the most—so you reach high-quality match probabilities quickly.
Early on you’ll see sim (raw similarity). After a few labels, you’ll see prob (calibrated
match probability) and the queried pairs concentrate near p≈0.5.
- Upload CSV(s)
- Label 10–30 pairs
- Run full scoring + download results
How it works
- Compute many string-distance similarity features from record text (e.g., a
text/namecolumn). - Compress features with PCA into a small embedding (principal components).
- Fit ridge logistic regression on PCs to predict match probability.
- Actively choose the next pairs to label to reduce uncertainty fastest (often near
p≈0.5). - Query strategy: start with diverse pairs, then pick batches that maximize expected information gain (entropy reduction).
As labels accumulate, the backend can increase capacity by using more PCs.
This page calls a hosted backend API. Uploads are processed for your session and used only to fit the in-session matcher (not for training outside your run). Data may be cached temporarily to generate outputs and is not intended to be retained as a dataset.
The backend is efficient and designed to run in a tiny compute footprint (CPU-only).
If you need a private deployment or a higher-throughput trial environment, email joe@josephsmiller.com.
Upload
For a hosted trial, request a demo token: email me.
CSV format (hosted demo): the backend uses a text column if present, else name, else the first
column, and ignores other columns. Tip: concatenate multiple fields into a single text column (e.g., name +
address) for better matching.
Sample datasets are optional — you can upload your own CSVs instead (subject to demo limits enforced by the backend).
Label
Suggestions only apply to high-confidence probabilities. Suggested labels are highlighted until you review the pair; uncertain pairs are left blank.
Compared fields: this hosted backend compares the text/name column (or first column). Other
deployments may compare multiple columns field-by-field.
Keyboard shortcuts
j/k or ↑/↓: move • a: match • s: skip • d: no-match •
x: clear • Enter: submit • Shift+N: next batch
Results
Predictions = scored candidate pairs (with probabilities once you’ve labeled a few pairs). Matches = linkage-mode pairs above the backend match threshold. Clusters are generated only for dedup mode (linkage mode returns matches instead).
- predictions: review scores and choose a cutoff for your workflow
- matches: linkage-mode “accept list” (mutual best matches above the backend threshold)
- clusters: dedup-mode group IDs produced from clustering the scored similarities
Metrics update as you label batches. Use “Run full scoring” to generate downloadable outputs.
Questions or want a private deployment / higher-throughput environment? Email me.