Skip to content

FTP: preserve filenames containing whitespace in _mlsd2#2043

Merged
martindurant merged 1 commit into
fsspec:masterfrom
mokashang:fix/ftp-mlsd2-filename-spaces
Jun 1, 2026
Merged

FTP: preserve filenames containing whitespace in _mlsd2#2043
martindurant merged 1 commit into
fsspec:masterfrom
mokashang:fix/ftp-mlsd2-filename-spaces

Conversation

@mokashang

Copy link
Copy Markdown
Contributor

Fixes #1970.

When the FTP server does not support MLSD, FTPFileSystem.ls() falls back to parsing the output of DIR, via _mlsd2. That parser used line.split() and took split_line[-1] as the filename, so any name containing whitespace was truncated to its last whitespace-separated token. The example in the issue:

-rw-r--r-- 1 owner group 38283 Jan 12 07:08 my file.pdf

would split into 10 fields and be returned as file.pdf instead of my file.pdf.

Fix

_mlsd2 now does line.split(maxsplit=8), which produces the first eight fixed fields (mode, link count, owner, group, size, month, day, time-or-year) plus the remainder of the line as a single string. The remainder is the filename, with any internal whitespace preserved as-is. This matches the approach suggested by the maintainer in the issue thread.

Symlink entries in ls -l output have the form <name> -> <target>, so for lines whose mode starts with l the trailing " -> <target>" is stripped to keep only the link's own name. (Previously the parser returned <target> as the entry name, since split_line[-1] picked the last token.)

Tests

Added four unit tests in fsspec/implementations/tests/test_ftp.py driving _mlsd2 with a small fake FTP object that replays a canned dir listing. This exercises the parser directly without needing an FTP server that disables MLSD:

  • test_mlsd2_parses_filename_without_spaces — sanity check that a plain filename still parses correctly.
  • test_mlsd2_parses_filename_with_spaces — the regression case, covering both files and directories, and also a filename with multiple consecutive internal spaces.
  • test_mlsd2_parses_symlink_name_without_target — verifies the link name (not the target) is returned for symlink entries.
  • test_mlsd2_skips_short_lines — confirms the existing skip for lines with fewer than 9 fields still discards the total N summary header some servers emit.

All four new tests pass. The full fsspec/tests/ and fsspec/implementations/tests/ suites continue to pass on this branch; the pre-existing test_ftp.py failures involving the pyftpdlib TLS test server fail identically on master on this machine and are unrelated to this change.

When the FTP server does not support MLSD, FTPFileSystem.ls() falls back to
parsing the output of DIR, which yields Linux-style `ls -l` lines. The
parser used `line.split()` and took `split_line[-1]` as the filename, so
any filename containing whitespace was truncated to its last token.

Switch to `line.split(maxsplit=8)` so the first eight fields are split out
and the remainder of the line is preserved as the filename, keeping any
internal whitespace intact. Symlink entries (`<name> -> <target>` from
`ls -l`) are also handled so the link's own name is returned rather than
the link target.

Adds unit tests that drive _mlsd2 with a fake FTP object so the parser can
be exercised without needing an FTP server that disables MLSD.

Fixes fsspec#1970
@martindurant

Copy link
Copy Markdown
Member

This one was bugging me, I'm glad you've come up with a simple solution.
However, FTP servers are notoriously inconsistent - are there yet more formatting oddities that might be foreseen?

@mokashang

Copy link
Copy Markdown
Contributor Author

Good question — I went back through the realistic variations of ls -l style listings that FTP servers actually emit, plus the non-Unix LIST formats, and ran each through the parser to see what survives. Sharing what I found, since I think it shapes how much further this PR should go.

Unix-style ls -l variations — all handled by split(maxsplit=8):

  • Single-digit day with a double space (Jan 3 07:08) — split() collapses whitespace runs into the field boundary, so this still parses correctly.
  • Year-in-place-of-time for old files (Jan 12 2024) — the date span is just three tokens stored verbatim, so it works identically.
  • ACL/SELinux suffix on the mode (-rw-r--r--. or -rw-r--r--+) — mode is stored as-is, no semantic parse.
  • Multiple internal spaces in a filename (a b c.txt) — preserved verbatim, because maxsplit=8 leaves the 9th chunk untouched.
  • Tabs inside a filename — same: preserved in the 9th chunk.
  • Trailing whitespace in a filename — also preserved, due to a Python quirk: when maxsplit is reached mid-content, the trailing chunk is not rstripped.
  • Filenames containing the literal substring " -> " on a regular (non-l...) entry — protected by the unix_mode[0] == "l" guard before we strip the arrow.

Unix-style cases that the parser still mis-handles (honestly):

  • Block / character device entries: crw-rw-rw- 1 root root 1, 3 Jan 12 07:08 null. The size field is split into major/minor (1, 3), which shifts every subsequent column by one. The current parser would report size="1,", modify="3 Jan 12", name="07:08 null". This is rare on real FTP exports (most servers don't surface device nodes), but it would silently produce a wrong row.
  • Symlinks where the link's own name contains " -> ": the .split(" -> ", 1) collapses too eagerly. Pathological enough that I think we can ignore it.

Non-Unix LIST formats (Windows IIS / NT-style, EPLF, VMS / MultiNet):

  • NT IIS format (01-15-25 03:45PM <DIR> dirname) produces 4–5 fields, so it falls into the existing < 9 skip and is silently dropped. Same for EPLF and the multi-line VMS format.
  • That's no worse than the pre-PR behavior — the old split_line[-1] would have returned the directory name for an NT row but with garbage for every other field, so callers were already broken there. The <9 skip arguably improves on that by returning no rows rather than wrong rows. Properly supporting these formats would require format-sniffing on the first line and dispatching to a per-format parser — a substantially larger change that I think belongs in its own PR (and is what pyftpdlib/ftputil do).

My take on the scope: I'd keep this PR focused on the original issue (filenames with whitespace on Unix-style servers, plus the symlink correctness drive-by that fell out of using maxsplit=8). Extending it to device-file rows or non-Unix LIST dialects would either gold-plate a rare case or invite a much bigger refactor. Happy to do either if you'd prefer — just say the word.

@martindurant

Copy link
Copy Markdown
Member

That information is very handy, and keeping it in this PR will allow others to find it too. I agree with your conclusions, and we can revisit if enough problematic scenarios emerge.

@martindurant martindurant merged commit 2818611 into fsspec:master Jun 1, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ls() method parses incorrectly when a filename contain a whitespace (FTP)

2 participants