Failed to start process from bash script
I
have a central server and I periodically launch a script (from cron) to check the remote server. The checks are performed continuously, so first one server, then another….
This script (
from the central server) starts another script (let’s call it update.sh) on the remote machine, and that script (on the remote machine) is doing something like this:
processID=`pgrep "processName"`
kill $processID
startProcess.sh
The process is terminated and then started in script startProcess.sh
:
pidof "processName"
if [ ! $? -eq 0 ]; then
nohup "processName" "processArgs" >> "processLog" &
pidof "processName"
if [! $? -eq 0]; then
echo "Error: failed to start process"
...
The actual binaries for the update.sh, startprocess.sh, and the processes it starts are on NFS mounted from a central server.
Sometimes, the process I tried to start in startprocess.sh didn’t start and I got the error. The strange thing is that it is random, sometimes processes on one machine start while others do not. I’m checking about 300 servers and the error is always random.
One more thing, the remote servers are located in 3 different geographical locations (2 in the US and 1 in Europe), with the central server in Europe. So far, I’ve found that servers in the US have more bugs than those in Europe.
First I thought the bug was related to kill, so I added a sleep between kill and startprocess.sh, but that didn’t make any difference.
Also, it seems that the process in startprocess.sh did not start at all, or something happened at startup because there is no output in the log file and should be the output in the log file.
So, I’m here for help
Has anyone ever encountered this kind of problem, or knows what went wrong?
Thanks for your help
Solution
(Sorry, but my original answer was rather wrong…) Here is the correction)
Use $?
Getting the exit status of a background process in the startProcess.sh
results in an error. Male bashstatus:
Special Parameters
? Expands to the status of the most recently executed foreground
pipeline.
As you mentioned in your comment, the correct way to get background process exit status is to use wait
built-in. But for this bash must be dealt with SIGCHLD signal.
I made a small test environment for this to show how it works:
This is a script loop.sh
run as a background process:
#!/bin/bash
[ "$1" == -x ] && exit 1;
cnt=${1:-500}
while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done
If the parameter is -x
, then it exits with exit status 1 to simulate an error. If arg is num, wait for num*5 seconds to print SLEEPING [<PID>] <counter>/<max_counter>
to standard output.
The second is the launcher script. It starts 3 loop.sh
runs scripts in the background and prints their exit status:
#!/bin/bash
handle_chld() {
local tmp=()
for i in ${!pids[@]}; do
if [ ! -d /proc/${pids[i]} ]; then
wait ${pids[i]}
echo "Stopped ${pids[i]}; exit code: $?"
unset pids[i]
fi
done
}
set -o monitor
trap "handle_chld" CHLD
# Start background processes
./loop.sh 3 &
pids+=($!)
./loop.sh 2 &
pids+=($!)
./loop.sh -x &
pids+=($!)
# Wait until all background processes are stopped
while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done
echo STOPPED
The handle_chld function will process the SIGCHLD signal. Setting the monitor
option enables non-interactive scripts to receive SIGCHLD. Then set traps for the SIGCHLD signal.
Then start the background process. All of their PIDs are recorded in large batches in PIDS
. If SIGCHLD is received, check in the /proc/ directory which child process is stopped (the missing one) (you can also use the built-in kill -0 <PID>
.) bash check). After waiting, the exit state of the background process is stored in the famous $?
Middle. Pseudo variables.
The main script waits for all pids to stop (otherwise it cannot get the exit status of its child processes) and stops itself.
Sample output:
WAITING FOR: 13102 13103 13104
SLEEPING [13103]: 1/2
SLEEPING [13102]: 1/3
Stopped 13104; exit code: 1
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13103]: 2/2
SLEEPING [13102]: 2/3
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13102]: 3/3
Stopped 13103; exit code: 0
WAITING FOR: 13102
WAITING FOR: 13102
WAITING FOR: 13102
Stopped 13102; exit code: 0
STOPPED
It can be seen that the exit codes are incorrect.
Hope this helps!