-
Notifications
You must be signed in to change notification settings - Fork 775
Description
Since the Apache Commons Net library upgrade to version 3.11.1 for Heritrix-3.10.0, Heritrix FTP Client does not accept responses that only contain an error code (instead of a code + a message on the same line). This can be seen in the logs:
2025-10-20T14:16:49.824Z -2 - ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf - - unknown #035 20251020141649615+209 - ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf - org.apache.commons.net.MalformedServerReplyException: Truncated server reply: '550 ' at org.apache.commons.net.ftp.FTP.getReply(FTP.java:634) at org.apache.commons.net.ftp.FTP.getReply(FTP.java:581) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1247) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1177) at org.apache.commons.net.ftp.FTP.cwd(FTP.java:447) at org.apache.commons.net.ftp.FTPClient.changeWorkingDirectory(FTPClient.java:1144) at org.archive.modules.fetcher.FetchFTP.fetch(FetchFTP.java:390) at org.archive.modules.fetcher.FetchFTP.innerProcess(FetchFTP.java:283) at org.archive.modules.Processor.innerProcessResult(Processor.java:176) at org.archive.modules.Processor.process(Processor.java:143) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148) org.apache.commons.net.MalformedServerReplyException: Truncated server reply: '550 ' at org.apache.commons.net.ftp.FTP.getReply(FTP.java:634) at org.apache.commons.net.ftp.FTP.getReply(FTP.java:581) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1247) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1177) at org.apache.commons.net.ftp.FTP.cwd(FTP.java:447) at org.apache.commons.net.ftp.FTPClient.changeWorkingDirectory(FTPClient.java:1144) at org.archive.modules.fetcher.FetchFTP.fetch(FetchFTP.java:390) at org.archive.modules.fetcher.FetchFTP.innerProcess(FetchFTP.java:283) at org.archive.modules.Processor.innerProcessResult(Processor.java:176) at org.archive.modules.Processor.process(Processor.java:143) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
There are still FTP servers around responding only with a code:
`WARC/1.0
WARC-Type: metadata
WARC-Target-URI: ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf
WARC-Date: 2025-10-20T14:01:54Z
WARC-IP-Address: 217.167.243.249
WARC-Record-ID: urn:uuid:cca83cd5-71d2-4ba5-86b0-701b9f02c88a
Content-Type: text/x-ftp-control-conversation
Content-Length: 585
- Opening control connection to 217.167.243.249:21
< 220 Microsoft FTP Service
USER bnftb
< 331 Password required
PASS eAr2Q5b6
< 230 User logged in.
CWD /BREST-EST_20251020.pdf
< 550
TYPE I
< 200 Type set to I.
PASV
< 227 Entering Passive Mode (217,167,243,249,193,45).
RETR /BREST-EST_20251020.pdf
< 125 Data connection already open; Transfer starting.
- Opened data connection to 217.167.243.249:49453
- Closed data connection to 217.167.243.249:49453
< 226 Transfer complete.
QUIT
< 221 Goodbye.
- Closed control connection to 217.167.243.249:21`
There is a parameter that allows to disable message compliance checking: strictReplyParsing in the commons-net FTP class. This parameter is private, so it cannot be modified outside the FTP class; we have to do it by reflection. Since Heritrix already extends the FTPClient class (which extends the FTP class), we just need to add the parameter deactivation in the constructor