Skip to content

Current implementation of ClientFTP not working with FTP servers partial responses #687

@saraaubry

Description

@saraaubry

Since the Apache Commons Net library upgrade to version 3.11.1 for Heritrix-3.10.0, Heritrix FTP Client does not accept responses that only contain an error code (instead of a code + a message on the same line). This can be seen in the logs:

2025-10-20T14:16:49.824Z -2 - ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf - - unknown #035 20251020141649615+209 - ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf - org.apache.commons.net.MalformedServerReplyException: Truncated server reply: '550 ' at org.apache.commons.net.ftp.FTP.getReply(FTP.java:634) at org.apache.commons.net.ftp.FTP.getReply(FTP.java:581) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1247) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1177) at org.apache.commons.net.ftp.FTP.cwd(FTP.java:447) at org.apache.commons.net.ftp.FTPClient.changeWorkingDirectory(FTPClient.java:1144) at org.archive.modules.fetcher.FetchFTP.fetch(FetchFTP.java:390) at org.archive.modules.fetcher.FetchFTP.innerProcess(FetchFTP.java:283) at org.archive.modules.Processor.innerProcessResult(Processor.java:176) at org.archive.modules.Processor.process(Processor.java:143) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148) org.apache.commons.net.MalformedServerReplyException: Truncated server reply: '550 ' at org.apache.commons.net.ftp.FTP.getReply(FTP.java:634) at org.apache.commons.net.ftp.FTP.getReply(FTP.java:581) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1247) at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:1177) at org.apache.commons.net.ftp.FTP.cwd(FTP.java:447) at org.apache.commons.net.ftp.FTPClient.changeWorkingDirectory(FTPClient.java:1144) at org.archive.modules.fetcher.FetchFTP.fetch(FetchFTP.java:390) at org.archive.modules.fetcher.FetchFTP.innerProcess(FetchFTP.java:283) at org.archive.modules.Processor.innerProcessResult(Processor.java:176) at org.archive.modules.Processor.process(Processor.java:143) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

There are still FTP servers around responding only with a code:
`WARC/1.0
WARC-Type: metadata
WARC-Target-URI: ftp://ftp.letelegramme.fr/BREST-EST_20251020.pdf
WARC-Date: 2025-10-20T14:01:54Z
WARC-IP-Address: 217.167.243.249
WARC-Record-ID: urn:uuid:cca83cd5-71d2-4ba5-86b0-701b9f02c88a
Content-Type: text/x-ftp-control-conversation
Content-Length: 585

  • Opening control connection to 217.167.243.249:21
    < 220 Microsoft FTP Service

USER bnftb
< 331 Password required
PASS eAr2Q5b6
< 230 User logged in.
CWD /BREST-EST_20251020.pdf
< 550
TYPE I
< 200 Type set to I.
PASV
< 227 Entering Passive Mode (217,167,243,249,193,45).
RETR /BREST-EST_20251020.pdf
< 125 Data connection already open; Transfer starting.

  • Opened data connection to 217.167.243.249:49453
  • Closed data connection to 217.167.243.249:49453
    < 226 Transfer complete.

QUIT
< 221 Goodbye.

  • Closed control connection to 217.167.243.249:21`

There is a parameter that allows to disable message compliance checking: strictReplyParsing in the commons-net FTP class. This parameter is private, so it cannot be modified outside the FTP class; we have to do it by reflection. Since Heritrix already extends the FTPClient class (which extends the FTP class), we just need to add the parameter deactivation in the constructor

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions