DB failover causes hard lockup but I want to fail reads/writes #6170
-
I am using PostgreSQL with Patroni and HA proxy so that traffic is directed to the primary node at all times. During a cluster fail-over I am getting errors like the one below. Unfortunately this causes a hard lock-up of the file system for any active reads or writes. I don't know how to have errors returned to the client in a case like this to abort the operation. I know that mount.cifs has a hard and soft mount option. The soft option will "not hang when the server crashes and will return errors to the user application". Note: localhost below is the HA proxy front-end address.
The process attempting a file read is stuck in disk sleep.
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 18 replies
-
The current implementation attempts transactions 50 times, with this value hardcoded in sql.go. If necessary, you can modify it. Pull requests (PRs) are welcome to make this an optional configuration. |
Beta Was this translation helpful? Give feedback.
-
@mcassaniti |
Beta Was this translation helpful? Give feedback.
-
After a bit more digging I found something rather interesting. If I run When this situation occurred today I had two nodes that had some locked-up processes. One refreshed its session and the other did not. The one that successfully refreshed its session could have processes killed with I'm going to try changing the heartbeat option to mount from the default of 12 seconds to something higher. This won't yet fix the locked-up processes, but it will mean I don't have to reboot to get things working again. |
Beta Was this translation helpful? Give feedback.
-
Another testing update. I put juicefs in debug mode and tried the same file copies I do during a cutover. Below you can see at most 5 retries. I don't believe that changing the SQL retries @jiefenghuang would assist based on what I'm seeing below. Strangely one of my nodes has the reading process stuck in disk sleep (state D) while my other nodes have the same process sleeping (state S). If I send a Reading the code here shows that only the attempt to begin the DB transaction is failing. After the transaction can be successfully started it should succeed and return based on the code flow.
|
Beta Was this translation helpful? Give feedback.
-
I tried something different and blocked the SQL traffic (dropping packets) from the JuiceFS client just to see what would show up in the logs. I got a lot of slow operations when I stopped dropping packets but not many failed transactions. I'm guessing that the issue is a hung query that doesn't have a timeout. Does a simple SQL get inside a read-only transaction have a timeout? |
Beta Was this translation helpful? Give feedback.
So, even without the change in SQL retries, the addition of
connect_timeout=5
has meant that my JuiceFS mounts are no longer getting stuck and causing processes to block.At this point I'll leave the SQL retries alone.
Thanks so much for taking the time to work with me on this one.