Skip to content

pfftools: add Maildir export format with MIME synthesis and contact deduplication#158

Open
KJ7LNW wants to merge 4 commits into
libyal:mainfrom
KJ7LNW:maildir
Open

pfftools: add Maildir export format with MIME synthesis and contact deduplication#158
KJ7LNW wants to merge 4 commits into
libyal:mainfrom
KJ7LNW:maildir

Conversation

@KJ7LNW

@KJ7LNW KJ7LNW commented May 14, 2026

Copy link
Copy Markdown

Description

libpff's pffexport tool previously supported only flat directory exports with no standard mail layout. This adds a -f maildir export mode producing RFC 2822 messages in a standard Maildir tree, consumable directly by mail user agents.

Type of Change

  • Feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change that fixes an issue)
  • Breaking change (fix or feature that causes existing functionality to change)
  • Refactor (no functional changes)
  • Documentation

Implementation Details

The Maildir exporter writes each email as a structurally valid RFC 2822 message into cur/, new/, and tmp/ subdirectories per PST folder. A folder rule table skips synthetic containers (Common Views, Finder, NON_IPM_SUBTREE), passes through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renames others (Root - Public -> Public Folders).

  • Cross-run deduplication: a djb2 hash table tracks seen Message-ID values, persisted to .seen_message_ids in the export root and reloaded on subsequent runs, enabling deduplication across multiple PST/OST archives without holding all IDs in memory.
  • MIME synthesis: both plain-text and HTML bodies are retrieved independently; attachments are enumerated and classified as inline or regular by Content-ID. The correct multipart/mixed, multipart/related, or multipart/alternative structure is selected from what is actually present. Original transport headers have MIME envelope lines stripped before writing so the synthesized headers are authoritative.
  • scripts/contact-to-vcf.py: reads Contact.txt files exported by libpff from stdin, merges duplicates keyed on primary email or display name, and emits a vCard 3.0 .vcf file to stdout.
  • Manual page updated to document -f maildir, deduplication behavior, and the folder rule table.

@joachimmetz

Copy link
Copy Markdown
Member

Thanks for the proposed changes but please break these up in separate PRs with tests:

  • deduplication
  • maildir export
  • vcf conversion

Comment thread pfftools/export_handle.c Outdated
#define SEEN_IDS_BUCKETS 1024

/* Linked list node tracking a single seen Message-ID value */
typedef struct seen_message_id seen_message_id_t;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this into its own source/header file.

Comment thread pfftools/export_handle.c Outdated
return( -1 );
}

/* base64 character table */

@joachimmetz joachimmetz Jun 27, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the base64 functions libuna provides

@joachimmetz joachimmetz added the pending reporter input Issue is pending input from the reporter label Jun 27, 2026
@joachimmetz joachimmetz self-assigned this Jun 27, 2026
Eric Wheeler added 4 commits June 28, 2026 20:41
Introduces a new -f maildir export mode that writes each email as an
RFC 2822 message into a Maildir tree (cur/, new/, tmp/ per folder),
producing output directly consumable by standard mail user agents.

To prevent duplicate messages when exporting multiple overlapping PST
or OST archives into the same Maildir tree, a djb2 hash table tracks
seen Message-ID values. The table is persisted to .seen_message_ids in
the export root and reloaded on subsequent runs, enabling cross-file
deduplication without holding all IDs in memory between invocations.

Maildir mode also applies a rule table to PST/OST internal folder names,
skipping synthetic containers (Common Views, Finder, NON_IPM_SUBTREE,
etc.), passing through transparent wrappers (Root - Mailbox,
IPM_SUBTREE) without creating a directory level, and renaming others
(Root - Public -> Public Folders). Non-email item types are silently
skipped so only RFC 2822-representable items appear in the output.

- add EXPORT_FORMAT_MAILDIR enum value and "maildir" input recognition
- add seen_message_ids_table_t hash table with load/save to .seen_message_ids
- add export_handle_initialize_maildir to build dedup path and load prior state
- add export_handle_export_email_maildir writing Maildir filenames to cur/
- add maildir_folder_rules table with skip/passthrough/rename actions
- create cur/, new/, tmp/ subdirectories per exported folder
- allow appending to existing export path in Maildir mode
- wire initialization and directory-exists bypass in pffexport main

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The Maildir exporter previously wrote a single body part with no MIME
envelope, choosing plain-text or HTML by fallback rather than capturing
both.  Attachments were not included in the output at all, and the
original transport headers were written verbatim, leaving conflicting
Content-Type and MIME-Version fields.

This rework makes the exporter produce structurally valid RFC 2822
messages.  Both plain-text and HTML bodies are retrieved independently.
Attachments are enumerated and their Content-ID, MIME type, and filename
are read to classify each as inline or regular.  The correct multipart
structure is then synthesised from what is actually present:
multipart/mixed wraps the body section and regular attachments,
multipart/related wraps an HTML body with its inline attachments, and
multipart/alternative wraps both body types when no attachments are
present.  Original transport headers have their MIME envelope lines
stripped before writing so the synthesised headers are authoritative.

- add PR_ATTACH_MIME_TAG and PR_ATTACH_CONTENT_ID defines for MAPI
  properties absent from the shared entry-type enum
- add mime_base64_write_attachment() to stream attachment data as
  base64 with CRLF-terminated 76-character lines
- add maildir_strip_mime_headers() to remove Content-Type,
  Content-Transfer-Encoding, and MIME-Version from original headers
- retrieve plain-text and HTML bodies independently, removing the
  plain-text-or-HTML fallback
- enumerate attachments to collect content-id, MIME type, and filename,
  classifying each as inline or regular before writing begins
- replace flat body write with a MIME structure decision tree that
  selects multipart/mixed, multipart/related, multipart/alternative, or
  a direct Content-Type based on available content
- move file close, dedup table insert, and success log after all parts
  are written; extend on_error cleanup to cover attachment metadata

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
Documents the -f maildir option added in the Maildir export
commits: RFC 2822 output layout, cross-run Message-ID
deduplication via .seen_message_ids, and the folder
skip/passthrough/rename rule table.

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The deduplication hash table was embedded as static functions in
export_handle.c alongside unrelated MIME logic.  Moving it to its own
translation unit separates concerns and makes the interface available
without exposing internals.

- move seen_message_ids_table type definition, hash, free, add,
  contains, load, and save from export_handle.c to seen_message_ids.c,
  promoting static functions to extern so the header can declare them
- add seen_message_ids.h with the opaque typedef and function prototypes
- include seen_message_ids.h in export_handle.h; drop the now-redundant
  forward typedef from export_handle.h
- add seen_message_ids.c and seen_message_ids.h to pffexport_SOURCES in
  Makefile.am
- replace hand-rolled base64 loop in mime_base64_write_attachment with
  libuna_base64_triplet_copy_from_byte_stream and
  libuna_base64_triplet_copy_to_base64_stream, removing the local
  mime_b64_chars table and its manual bit-twiddling

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
@KJ7LNW

KJ7LNW commented Jun 29, 2026

Copy link
Copy Markdown
Author

I pushed the minor extraction for you. I think that's consistent, but I no longer have the tooling to test this. The system I was working on and the dataset that I was using for validation is no longer available to me.

As such, I hesitate to perform any other changes beyond these because I can't validate them. If you would like to accept them, then great, maybe it will help others. If not, then I understand.

Additionally I created a separate pull request for the VCF card conversion in #161

@joachimmetz

Copy link
Copy Markdown
Member

Thanks for the changes and additional context. Some level of testing would be necessary to catch regressions. When time permits, I'll take a look what I can salvage. Also given potential loss or alteration of data when converting to other formats

@joachimmetz joachimmetz removed the pending reporter input Issue is pending input from the reporter label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants