← Demos
Record linkage demo

Record linkage with active learning

Deduplicate a file or link two files with far fewer labels. Instead of labeling random pairs, the system actively asks for the next pair that will teach the matcher the most—so you reach high-quality match probabilities quickly.

Early on you’ll see sim (raw schema-aware similarity). After a few labels, you’ll see prob (calibrated match probability) and the queried pairs concentrate near p≈0.5.

  1. Upload CSV(s)
  2. Label 10–30 pairs
  3. Run full scoring + download results

How it works

As labels accumulate, the backend can increase capacity by using more PCs.

Hosted API demo

This page calls a hosted backend API. Uploads are processed for your session and used only to fit the in-session matcher (not for training outside your run). Data may be cached temporarily to generate outputs and is not intended to be retained as a dataset.

The backend is efficient and designed to run in a tiny compute footprint (CPU-only).

If you need a private deployment or a higher-throughput environment, email joe@josephsmiller.com.

Upload

The public demo may be rate-limited. If it is slow, email me and I can give you private access: email me.

Sample preview

CSV format (hosted demo): upload one CSV for deduplication or left/right CSVs for linkage. Matching-name columns are treated as candidate matching evidence, and a text/name-like column is used as the display string.

Use standard CSV quoting for commas inside a field. If you are unsure about the format, download one of the sample CSVs first and match that structure.

Sample datasets are optional — you can upload your own CSVs instead (subject to demo limits enforced by the backend).

Label

Suggestions only apply to high-confidence probabilities. Suggested labels are highlighted until you review the pair; uncertain pairs are left blank.

--
Score shown
--
Pool remaining
--
PCs used
--
Variance explained

Compared fields: this hosted backend infers compatible fields from your CSVs and compares multiple columns field-by-field when they are available.

Keyboard shortcuts

j or : next • k or : previous • a: match • s: skip • d: no-match • x: clear • Enter: submit • Shift+N: next batch

Results

Predictions = scored candidate pairs (with probabilities once you’ve labeled a few pairs). Matches = linkage-mode pairs above the backend match threshold. Clusters are generated only for dedup mode (linkage mode returns matches instead).

0
Matches labeled
0
No-matches labeled

Metrics update as you label batches. Use “Run full scoring” to generate downloadable outputs.

Questions, or want a private deployment / higher-throughput environment? Email me.