Recently, I was working on a problem with WildFly and Java hot code replacement. When the JVM refuses to update a class (for example, because you added a method), JBoss Server tooling tries to be helpful and offers to restart the web apps in the WildFly server. Unfortunately, this doesn’t work 100%: even though the class in question is deployed and used, further changes in the class will be refused. While debugging a potential fix for this (it’s a garbage collection problem, but that’s not the point of this story), I installed a new version of OpenJDK and bam! everything just stopped working. Ok, "everything" is overstating the problem a bit, but I couldn’t deploy to the server anymore, which for a web app is pretty bad. So forget about hot code replacement…this is MUCH worse. Seems I was hitting a known problem.
So what’s the problem?
The tooling complained that it couldn’t move some file to its final location, in the deployment directory of the server. Often the root directory, but not always. Hmh…the directory being locked maybe? This was on Windows, so I fired up
cmd.exe and looked at the directory in question (
dir /Q is helpful). This is where it gets weird. The directory was there, but it had no owner. Deleting it was impossible. Wat?!
procmon.exe to see what was happening. The move to the directory failed with
DELETE_PENDING. Looking at the source, this directory had been deleted (successfully) as part of the deployment, and recreated. All of a sudden, my spidey sense tingled…there was something weird about deleting stuff on Windows (I used to be a real hacker, you know…). On Unix-like OS’s, deleting a file just removes it from the directory. Once the last link to the file is gone it is deleted as soon as the last handle to the file is closed. Windows almost does the same thing, but the file is not immediately removed from the directory. Instead, it is in a zombie state, where you can still see the file, but you can’t do anything with it. That would explain the problem, so like a good little scientist, I then had a hypothesis: the WildFly server was somehow keeping a handle to the deleted directory open and thus, kept it alive. Since WildFly is using the Java WatchService to find out about changes in the deployment directory, the WatchService was my designated bad guy for the time being.
It’s quite hard to test this against the WildFly server (it being a, let’s say, "non-trivial" program), so what was needed was a simple program that reproduces the problem…clickety click…here we go: two processes, one is watching a directory, the other tries to delete, recreate and modify the directory. Looks good, except it did NOT reproduce the problem.. This could be timing-dependent, so I applied a time tested strategy called "poking at the problem". Just added some sleep(500)'s in random places and hoped for the best. Indeed, after a couple of tries, I was able to reproduce the problem maybe 20% of the time. Even better: I could reproduce it with the debugger attached. Good thing this was on OpenJDK: it meant I could just download the sources and debug through the WatchService class. And indeed, the WatchService was stuck trying to clean up the handle to our directory. It had dutifully cancelled any pending "overlapped" I/O requests (that’s just asynchronous I/O for the non-Windows folk) and now was waiting for the I/O to be indeed cancelled. But the last operation on the directory had failed (it was deleted, remember?), so there was no pending I/O, which means the thread would wait forever.
We now have a bug report and a patched build of OpenJDK that allows me to get back to solving the original problem. That’s pretty cool, eh? This is a good example of what you can do when a system is open to inspection. It helps that I can look at the OpenJDK sources; debugging bytecodes is definitely no fun. But the coding style used in the WatchService class is also very important. They mirror the Windows native calls in a Java class. This (and MSDN) makes it possible to debug the windows specific implementation of the watch service.
Thanks to Rob Stryker and Alex Kashchenko who helped me debug this problem.