pfftools: add Maildir export format with MIME synthesis and contact deduplication#158
pfftools: add Maildir export format with MIME synthesis and contact deduplication#158KJ7LNW wants to merge 4 commits into
Conversation
2db2e0b to
9c5994a
Compare
|
Thanks for the proposed changes but please break these up in separate PRs with tests:
|
| #define SEEN_IDS_BUCKETS 1024 | ||
|
|
||
| /* Linked list node tracking a single seen Message-ID value */ | ||
| typedef struct seen_message_id seen_message_id_t; |
There was a problem hiding this comment.
move this into its own source/header file.
| return( -1 ); | ||
| } | ||
|
|
||
| /* base64 character table */ |
There was a problem hiding this comment.
Please use the base64 functions libuna provides
Introduces a new -f maildir export mode that writes each email as an RFC 2822 message into a Maildir tree (cur/, new/, tmp/ per folder), producing output directly consumable by standard mail user agents. To prevent duplicate messages when exporting multiple overlapping PST or OST archives into the same Maildir tree, a djb2 hash table tracks seen Message-ID values. The table is persisted to .seen_message_ids in the export root and reloaded on subsequent runs, enabling cross-file deduplication without holding all IDs in memory between invocations. Maildir mode also applies a rule table to PST/OST internal folder names, skipping synthetic containers (Common Views, Finder, NON_IPM_SUBTREE, etc.), passing through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renaming others (Root - Public -> Public Folders). Non-email item types are silently skipped so only RFC 2822-representable items appear in the output. - add EXPORT_FORMAT_MAILDIR enum value and "maildir" input recognition - add seen_message_ids_table_t hash table with load/save to .seen_message_ids - add export_handle_initialize_maildir to build dedup path and load prior state - add export_handle_export_email_maildir writing Maildir filenames to cur/ - add maildir_folder_rules table with skip/passthrough/rename actions - create cur/, new/, tmp/ subdirectories per exported folder - allow appending to existing export path in Maildir mode - wire initialization and directory-exists bypass in pffexport main Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The Maildir exporter previously wrote a single body part with no MIME envelope, choosing plain-text or HTML by fallback rather than capturing both. Attachments were not included in the output at all, and the original transport headers were written verbatim, leaving conflicting Content-Type and MIME-Version fields. This rework makes the exporter produce structurally valid RFC 2822 messages. Both plain-text and HTML bodies are retrieved independently. Attachments are enumerated and their Content-ID, MIME type, and filename are read to classify each as inline or regular. The correct multipart structure is then synthesised from what is actually present: multipart/mixed wraps the body section and regular attachments, multipart/related wraps an HTML body with its inline attachments, and multipart/alternative wraps both body types when no attachments are present. Original transport headers have their MIME envelope lines stripped before writing so the synthesised headers are authoritative. - add PR_ATTACH_MIME_TAG and PR_ATTACH_CONTENT_ID defines for MAPI properties absent from the shared entry-type enum - add mime_base64_write_attachment() to stream attachment data as base64 with CRLF-terminated 76-character lines - add maildir_strip_mime_headers() to remove Content-Type, Content-Transfer-Encoding, and MIME-Version from original headers - retrieve plain-text and HTML bodies independently, removing the plain-text-or-HTML fallback - enumerate attachments to collect content-id, MIME type, and filename, classifying each as inline or regular before writing begins - replace flat body write with a MIME structure decision tree that selects multipart/mixed, multipart/related, multipart/alternative, or a direct Content-Type based on available content - move file close, dedup table insert, and success log after all parts are written; extend on_error cleanup to cover attachment metadata Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
Documents the -f maildir option added in the Maildir export commits: RFC 2822 output layout, cross-run Message-ID deduplication via .seen_message_ids, and the folder skip/passthrough/rename rule table. Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The deduplication hash table was embedded as static functions in export_handle.c alongside unrelated MIME logic. Moving it to its own translation unit separates concerns and makes the interface available without exposing internals. - move seen_message_ids_table type definition, hash, free, add, contains, load, and save from export_handle.c to seen_message_ids.c, promoting static functions to extern so the header can declare them - add seen_message_ids.h with the opaque typedef and function prototypes - include seen_message_ids.h in export_handle.h; drop the now-redundant forward typedef from export_handle.h - add seen_message_ids.c and seen_message_ids.h to pffexport_SOURCES in Makefile.am - replace hand-rolled base64 loop in mime_base64_write_attachment with libuna_base64_triplet_copy_from_byte_stream and libuna_base64_triplet_copy_to_base64_stream, removing the local mime_b64_chars table and its manual bit-twiddling Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
|
I pushed the minor extraction for you. I think that's consistent, but I no longer have the tooling to test this. The system I was working on and the dataset that I was using for validation is no longer available to me. As such, I hesitate to perform any other changes beyond these because I can't validate them. If you would like to accept them, then great, maybe it will help others. If not, then I understand. Additionally I created a separate pull request for the VCF card conversion in #161 |
|
Thanks for the changes and additional context. Some level of testing would be necessary to catch regressions. When time permits, I'll take a look what I can salvage. Also given potential loss or alteration of data when converting to other formats |
Description
libpff's
pffexporttool previously supported only flat directory exports with no standard mail layout. This adds a-f maildirexport mode producing RFC 2822 messages in a standard Maildir tree, consumable directly by mail user agents.Type of Change
Implementation Details
The Maildir exporter writes each email as a structurally valid RFC 2822 message into
cur/,new/, andtmp/subdirectories per PST folder. A folder rule table skips synthetic containers (Common Views, Finder, NON_IPM_SUBTREE), passes through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renames others (Root - Public -> Public Folders).Message-IDvalues, persisted to.seen_message_idsin the export root and reloaded on subsequent runs, enabling deduplication across multiple PST/OST archives without holding all IDs in memory.multipart/mixed,multipart/related, ormultipart/alternativestructure is selected from what is actually present. Original transport headers have MIME envelope lines stripped before writing so the synthesized headers are authoritative.scripts/contact-to-vcf.py: readsContact.txtfiles exported by libpff from stdin, merges duplicates keyed on primary email or display name, and emits a vCard 3.0.vcffile to stdout.-f maildir, deduplication behavior, and the folder rule table.