FTP: preserve filenames containing whitespace in _mlsd2#2043
Conversation
When the FTP server does not support MLSD, FTPFileSystem.ls() falls back to parsing the output of DIR, which yields Linux-style `ls -l` lines. The parser used `line.split()` and took `split_line[-1]` as the filename, so any filename containing whitespace was truncated to its last token. Switch to `line.split(maxsplit=8)` so the first eight fields are split out and the remainder of the line is preserved as the filename, keeping any internal whitespace intact. Symlink entries (`<name> -> <target>` from `ls -l`) are also handled so the link's own name is returned rather than the link target. Adds unit tests that drive _mlsd2 with a fake FTP object so the parser can be exercised without needing an FTP server that disables MLSD. Fixes fsspec#1970
|
This one was bugging me, I'm glad you've come up with a simple solution. |
|
Good question — I went back through the realistic variations of Unix-style
Unix-style cases that the parser still mis-handles (honestly):
Non-Unix
My take on the scope: I'd keep this PR focused on the original issue (filenames with whitespace on Unix-style servers, plus the symlink correctness drive-by that fell out of using |
|
That information is very handy, and keeping it in this PR will allow others to find it too. I agree with your conclusions, and we can revisit if enough problematic scenarios emerge. |
Fixes #1970.
When the FTP server does not support
MLSD,FTPFileSystem.ls()falls back to parsing the output ofDIR, via_mlsd2. That parser usedline.split()and tooksplit_line[-1]as the filename, so any name containing whitespace was truncated to its last whitespace-separated token. The example in the issue:would split into 10 fields and be returned as
file.pdfinstead ofmy file.pdf.Fix
_mlsd2now doesline.split(maxsplit=8), which produces the first eight fixed fields (mode, link count, owner, group, size, month, day, time-or-year) plus the remainder of the line as a single string. The remainder is the filename, with any internal whitespace preserved as-is. This matches the approach suggested by the maintainer in the issue thread.Symlink entries in
ls -loutput have the form<name> -> <target>, so for lines whose mode starts withlthe trailing" -> <target>"is stripped to keep only the link's own name. (Previously the parser returned<target>as the entry name, sincesplit_line[-1]picked the last token.)Tests
Added four unit tests in
fsspec/implementations/tests/test_ftp.pydriving_mlsd2with a small fake FTP object that replays a canneddirlisting. This exercises the parser directly without needing an FTP server that disablesMLSD:test_mlsd2_parses_filename_without_spaces— sanity check that a plain filename still parses correctly.test_mlsd2_parses_filename_with_spaces— the regression case, covering both files and directories, and also a filename with multiple consecutive internal spaces.test_mlsd2_parses_symlink_name_without_target— verifies the link name (not the target) is returned for symlink entries.test_mlsd2_skips_short_lines— confirms the existing skip for lines with fewer than 9 fields still discards thetotal Nsummary header some servers emit.All four new tests pass. The full
fsspec/tests/andfsspec/implementations/tests/suites continue to pass on this branch; the pre-existingtest_ftp.pyfailures involving the pyftpdlib TLS test server fail identically onmasteron this machine and are unrelated to this change.