Skip to content

Enhanced dedup#259

Merged
TNRiley merged 8 commits into
devfrom
enhanced-dedup
Jun 1, 2026
Merged

Enhanced dedup#259
TNRiley merged 8 commits into
devfrom
enhanced-dedup

Conversation

@TNRiley
Copy link
Copy Markdown
Member

@TNRiley TNRiley commented Jun 1, 2026

See 0.2.1 news.md

TNRiley added 6 commits June 1, 2026 10:05
Enable an auto-dedup-now, review-later workflow so users can pause after
automatic deduplication, export, and complete manual review on re-import.

Package:
- export_dedup_candidates() / reimport_dedup_candidates(): persist and
  restore the $manual_dedup candidate pairs across an export/reimport
  boundary (IDs kept as character for re-merging).
- export_csv() gains manual_dedup_complete flag (written as a column on
  full exports; read back by reimport_csv()) as a UX guard.
- reimport_csv() now reads all columns as character, matching the
  canonical all-character types from dedup_citations(). Required so a
  reimported set can re-enter dedup_citations_add_manual() without
  column-type clashes (read.csv otherwise infers integer ids/years).
- Tests in test-reimport.R cover the round-trip and merge.

Shiny app:
- file_reimport observer now handles multiple files and routes by
  content (candidate-pairs CSV vs deduplicated citation set vs RIS),
  fixing a latent length>1 condition error on multi-file selection.
- Restoring candidate pairs repopulates the Manual deduplication tab on a
  reimported set; the result column is dropped so merges follow the
  user's row selection.
- Export tab: Candidate Pairs (CSV) download; CSV export sets the
  manual_dedup_complete flag based on whether pairs remain pending.
dedup_citations_add_sources(existing, new_raw) adds new raw citations to a
previously deduplicated set and re-deduplicates across both, preserving
prior auto/manual merge decisions and the original record_ids provenance.
For the same data it produces the same unique set as deduplicating
everything from scratch (validated on the gambling-harms vignette data:
163 existing + 431 new -> 278 unique, == from-scratch; 645 underlying
record_ids preserved).

Implementation reconciles IDs (existing duplicate_id -> record_id; new
records get fresh non-colliding ids based on the max underlying id), drops
duplicate_id/record_ids so the engine's format_rerun rename can't clash on
record_id, re-runs dedup_citations(), then expands record_ids back to the
original underlying IDs via a provenance lookup. Works in manual = TRUE
mode to surface new candidate pairs.

Shiny app:
- file_reimport sets rv$existing_dedup_present when a deduplicated set is
  re-imported.
- identify_dups: with an existing re-imported set present, "Find
  duplicates" merges new uploads in via dedup_citations_add_sources();
  otherwise deduplicates the uploads as before. Uploads (and the upload
  form) are cleared after a merge to prevent adding the same records twice.
- Deduplicate tab hint describes the add-sources flow.

Tests in test-add-sources.R.
On the File upload tab, re-importing a previously deduplicated/exported set
now renders a view-only summary card listing per-source (and label/string)
record counts, so users can see what is already in the set before adding
more references. Tokens are de-duplicated within each record, so each unique
record counts once per distinct source/label/string. The card also notes the
total record count and whether manual deduplication was marked complete. It
is kept separate from the new-uploads metadata form and does not allow
editing source/label/string.
…ad page

User Guide (in-app www/user_guide.md):
- Step 1: re-import section rewritten — re-importing is no longer a dead end;
  describes the read-only source-overview card and the three paths (continue to
  analysis, add new sources, or finish manual review by re-uploading candidate
  pairs). Adds a "growing a review over time" note.
- Step 2: note that Find duplicates merges new uploads into a re-imported set.
- Step 3: how to pause and finish manual review later via candidate-pairs export.
- Step 6: document Dedup Log and Candidate Pairs downloads and the
  manual-dedup-complete flag in the full CSV.

File upload page (app.R sidebar): clearer labels and helper text distinguishing
"upload new files to deduplicate" from "re-upload a CiteSource export" (and that
the two can be combined to add sources or resume manual review).

README: note incremental add-sources and deferred manual review on re-import.
- DESCRIPTION: Version 0.2.0 -> 0.2.1, Date 2026-06-01.
- NEWS.md: add 0.2.1 section covering incremental deduplication
  (dedup_citations_add_sources), deferred manual deduplication
  (export/reimport_dedup_candidates + manual_dedup_complete flag), the Shiny
  re-import source overview and multi-file content-routed re-upload, the
  all-character reimport_csv fix, and the doc updates. These were moved out of
  the released 0.2.0 section.
- CITATION.cff: version 0.2.1, date-released 2026-06-01.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces incremental deduplication and a deferred manual-review workflow, allowing users to (1) add new sources to an already-deduplicated set and (2) export/import manual candidate pairs to finish review later. It updates the R API, adds tests, and wires the workflows into the Shiny app plus documentation/release metadata.

Changes:

  • Add dedup_citations_add_sources() for incremental deduplication while preserving record_ids provenance.
  • Add candidate-pair export/import helpers (export_dedup_candidates() / reimport_dedup_candidates()) and make reimport_csv() read all columns as character for type-stable round-trips.
  • Update Shiny app UX and exports to support re-uploading deduped sets, restoring candidate pairs, and exporting a manual_dedup_complete flag.

Reviewed changes

Copilot reviewed 12 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
R/dedup.R Adds dedup_citations_add_sources() incremental dedup + provenance restoration.
R/export.R Adds manual_dedup_complete export flag + export_dedup_candidates().
R/reimport.R Forces all-character CSV reimport + adds reimport_dedup_candidates().
inst/shiny-app/CiteSource/app.R Supports multi-file re-import routing (dedup set vs candidate pairs), merge-new-sources flow, and candidate-pair download.
inst/shiny-app/CiteSource/www/user_guide.md Documents incremental + deferred manual-review workflows in the app.
README.md Documents incremental + deferred manual-review workflows in the package README.
tests/testthat/test-add-sources.R Adds automated tests for incremental deduplication behavior.
tests/testthat/test-reimport.R Adds round-trip tests for CSV reimport types, flags, and candidate-pair workflows.
NAMESPACE Exports newly added public functions.
man/dedup_citations_add_sources.Rd Generated docs for dedup_citations_add_sources().
man/export_csv.Rd Generated docs for new export_csv() parameter.
man/export_dedup_candidates.Rd Generated docs for export_dedup_candidates().
man/reimport_dedup_candidates.Rd Generated docs for reimport_dedup_candidates().
NEWS.md Adds 0.2.1 release notes covering new workflows and fixes.
DESCRIPTION Bumps package version/date to 0.2.1.
CITATION.cff Bumps citation version/date-released to 0.2.1.
Files not reviewed (4)
  • man/dedup_citations_add_sources.Rd: Language not supported
  • man/export_csv.Rd: Language not supported
  • man/export_dedup_candidates.Rd: Language not supported
  • man/reimport_dedup_candidates.Rd: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/dedup.R
Comment on lines +245 to +253
max_id <- suppressWarnings(max(as.numeric(existing_ids), na.rm = TRUE))

nw <- dplyr::mutate(new_citations, dplyr::across(dplyr::everything(), as.character))
nw <- dplyr::select(nw, -dplyr::any_of(c("duplicate_id", "record_ids", "record_id")))
nw$record_id <- if (is.finite(max_id)) {
as.character(max_id + seq_len(nrow(nw)))
} else {
paste0("new_", seq_len(nrow(nw)))
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_ids would never be non-numeric

Comment thread R/dedup.R Outdated
TNRiley and others added 2 commits June 1, 2026 11:47
R CMD check --as-cran is now 0 errors | 0 warnings | 1 note (the note is the
expected "New submission" feasibility notice plus transient URL-check resets).

- R/dedup.R: replace non-ASCII em-dashes (incl. one in a stop() string) with
  ASCII hyphens; regenerate affected .Rd files. Clears the "non-ASCII
  characters in R code" WARNING.
- .Rbuildignore: exclude CLAUDE.md, guide/, and .tmp* so dev/session files are
  not bundled into the build tarball. Clears the "non-standard top-level files"
  NOTE.
- cran-comments.md: rewritten for the 0.2.1 feature update.

Note: networkD3 remains in Suggests but is unused anywhere in the package
(harmless on CRAN; flagged for optional removal).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@TNRiley TNRiley merged commit 6c98cf8 into dev Jun 1, 2026
2 checks passed
@TNRiley TNRiley deleted the enhanced-dedup branch June 1, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants