← Demos
Record linkage demo

Record linkage with active learning

Deduplicate a file or link two files with far fewer labels. Instead of labeling random pairs, the system actively asks for the next pair that will teach the matcher the most—so you reach high-quality match probabilities quickly.

Early on you’ll see sim (raw similarity). After a few labels, you’ll see prob (calibrated match probability) and the queried pairs concentrate near p≈0.5.

  1. Upload CSV(s)
  2. Label 10–30 pairs
  3. Run full scoring + download results

How it works

As labels accumulate, the backend can increase capacity by using more PCs.

Hosted API demo

This page calls a hosted backend API. Uploads are processed for your session and used only to fit the in-session matcher (not for training outside your run). Data may be cached temporarily to generate outputs and is not intended to be retained as a dataset.

The backend is efficient and designed to run in a tiny compute footprint (CPU-only).

If you need a private deployment or a higher-throughput trial environment, email joe@josephsmiller.com.

Upload

For a hosted trial, request a demo token: email me.

Download sample left CSV Download sample right CSV

CSV format (hosted demo): the backend uses a text column if present, else name, else the first column, and ignores other columns. Tip: concatenate multiple fields into a single text column (e.g., name + address) for better matching.

Sample datasets are optional — you can upload your own CSVs instead (subject to demo limits enforced by the backend).

Label

Suggestions only apply to high-confidence probabilities. Suggested labels are highlighted until you review the pair; uncertain pairs are left blank.

--
Score shown
--
Pool remaining
--
PCs used
--
Variance explained

Compared fields: this hosted backend compares the text/name column (or first column). Other deployments may compare multiple columns field-by-field.

Keyboard shortcuts

j/k or ↑/↓: move • a: match • s: skip • d: no-match • x: clear • Enter: submit • Shift+N: next batch

Results

Predictions = scored candidate pairs (with probabilities once you’ve labeled a few pairs). Matches = linkage-mode pairs above the backend match threshold. Clusters are generated only for dedup mode (linkage mode returns matches instead).

0
Match labels
0
No-match labels

Metrics update as you label batches. Use “Run full scoring” to generate downloadable outputs.

Questions or want a private deployment / higher-throughput environment? Email me.