The Case of the Lost Data Packets

THE MYSTERY

An aircraft manufacturer has been using a system-wide data acquisition system (DAQ) designed by ACES about three years ago. Part of the protocol involves measuring metal heat treat batches that sometimes run around the clock.

The customer called ACES because they were losing data for about 20 minutes every Saturday morning. What exactly had ACES programmed the system to do on Saturday mornings?

THE CLUES

The ACES CSI (Control System Investigator) who’d programmed the DAQ knew that there wasn’t anything programmed for Saturday mornings: All his programs were based on batch level, not time of day.

The drives in question were on a domain administered by the customer, so the CSI asked the IT department to put out a network sniffer to analyze the data packets. The only thing they found was that from 8:00-8:20am on Saturday the local computer was trying to authenticate to a remote server two or three times. Once it succeeded it stayed authenticated until 8:00 the next Saturday.

The CSI went back to his own laptop and simulated mapping to local folders. He started his application running and experimented with disconnecting the drives. He was able to recreate the customer’s issue, but still wasn’t sure what was causing the drive to go offline.

The IT department checked with the network administrators in charge of the drives and they said, “Oh, at that time of day on Saturday we’re doing a defrag.”

THE PERP

After some research, the CSI discovered the whole error came down to a “feature” in Windows. When you ask Windows if a network drive exists, it doesn’t just respond with a simple yes. It goes out and checks the status of that network drive. If the network drive is down, or available but offline for some reason — for example doing defrags — File Explorer will stop responding for a few minutes on any application that was calling the network map. The data acquisition application was hanging from 8:00-8:20 — with only one or two reads instead of once every minute the customer was losing data.

The CSI had to find a way to make the app continue to run while it was waiting for the return answer from Windows.

THE SOLUTION

He programmed an application that he didn’t care about locking up — a file transfer utility that locks up and then later moves the missed files. If a defrag is in process then the app hangs — but it’s not currently collecting any data, so it doesn’t matter if it hangs.

The main application that collects data no longer hangs because it never checks the status of the network drive. It stores the batch data in a local directory and then the file transfer utility — when it sees the network drive is back online — retrieves the files and stores them in the network directory.

Now the customer’s batch data flows seamlessly and automatically from the local directory to the network directory. Even early on Saturday mornings.

Long after the integration of a complex DAQ, changes in procedure can require adjustments to the system. ACES CSIs are standing by to invent creative solutions to any challenges that arise.

CASE CLOSED