Wednesday, July 25, 2012

Multiprocessing in a Bash script

    Multiprocessing in  a Bash script is a very simple concept and is very easy to implement. By multiprocessing I mean that you can run multiple processes and collate their results instead of having to wait for one to finish before you start the next. Obviously, you won't be able to manage the threads with the kind of fine-grain control a full fledged programming language facilitates, but its always useful and helpful to know.

The Use-case:
 A script my team was using took nearly 5 minutes to complete, by implementing a solution similar to the one shown below, I was able to reduce its run-time from 5 minutes to 10 seconds - helpful isn't it? Well, if your use-case warrants it, I think you will find this helpful.

Consider the following script that pulls host information from a service:

When you query the host information service for a hostname like:

The service returns a lot of info about the host in the following format:
Hardware type:
Here is how we will get the 'Status' info for a bunch of hosts:

for HOSTNAME in `cat hostlist`
    echo -n "Hostname: $HOSTNAME"
    curl $URL | grep "Status"

The Problem:

When the host list (the 'hostlist' file) is very long (say 1000 hosts), the script will take ages to complete.

The Solution:
for HOSTNAME in `cat hostlist`
    curl $URL | egrep 'Hostname|Status' | paste - - >> /tmp/output &
cat /tmp/output | sort 

Note: Ensure ulimit is set so that you don't inadvertently create a fork bomb!

You may try the following to prevent a fork bomb:

for HOSTNAME in `cat hostlist`
    curl $URL | egrep 'Hostname|Status' | paste - - >> /tmp/output &
    if [ `jobs | grep -i 'running' | wc -l` -gt $LIMIT ]
cat /tmp/output | sort 

A much better solution:
Thanks to this redditor

cat hostlist | xargs -n 1 -P $LIMIT -I{} bash -c "curl $URL/{} |"\
"egrep -i 'Hostname|Status' | paste - -" > /tmp/output
cat /tmp/output | sort 

The explanation for the above script is given below in the 'Better Solutions' section.

The Explanation:
What we are trying to do is use multiprocessing to reduce runtime by doing the following:
  1. Make all queries to the service in parallel.
    •  This is achieved by putting the '&' at the end of the curl to push it to the background. This spawns the curl as a child process i.e., kicks off a different process.
  2. Consolidate the output without jumbling it up.
    This is achieved by doing the following:
    • Each background process must give the complete output. This forces us to get both the hostname and status info from 'curl' as opposed to just using 'echo' for the hostname and 'curl' for the status (as in the first script). The reason being, if we used different commands viz., echo and curl - both will be two different processes, each completing at a different time. The 'curl' with the first 'echo' may complete only after the 4th 'echo' - this will jumble the output.
    • By ensuring that each spawned process generates the complete required output, we avoid jumbling it up.
    • What should you do if you can't parse the output from a single process?
      • For example, if you had to pull info from two different commands or maybe two different curl statements and put the output into one line, the above method will not work, you will need to create a small script consisting of these two lines and call the script as a child process or you may spawn a subshell by using '()'.
  3. Avoid race condidtions - all spawned processes need to write to the same place to consolidate the output. 
    •  This is achieved by appending ('>>') the output of the child processes to a file ('/tmp/output'). Each process, irrespective of when it completes, will append to the file. Since we are using '>>', bash will handle the race conditions when two child processes attempt to write to the file simultaneously.
  4.  Allow all jobs to finish before generating the output:
    • This is done by the 'while' loop and the 'cat|sort' at the end.
    • All the while loop does is checks if the 'jobs' command shows any jobs as running. If so, it tries again after 1 second, if not, it breaks out of the loop.
    • The final statement sorts the output and shows it to STDOUT.

Further Suggestions: 
  • How about multiple instances of the same process?
    • A script may call itself by using '$0' and have a condition within itself to exit. For example:
      • #!/bin/bash
        echo "Hello from $$"
        echo "1" >> /tmp/t
        if [ `cat /tmp/t | wc -l` -le 10 ]
            $0 &
      • The above script will keep calling itself till the length of the file '/tmp/t' reaches 11.
    • Better solutions?
      • There are better solutions viz.,
        • xargs -P
          • xargs takes input from STDIN '-n' args at a time. In the script we are specified '-n 1', which makes xargs deal with one argument at a time.
          • It applies the argument to the command specified viz., 'bash -c curl $URL/{}....'
          • The '{}' is a placeholder for the arguments, specified by the '-I {}' option.
          • The '-P $LIMIT' option tells xargs to launch  $LIMIT processes in parallel.
          • So this means, 'cat hostlist' passes the entire hostlist to xargs.
          • xargs takes one hostname, appends it to the statement given (calling curl and filtering out the output).
          • It does this by running $LIMIT no. of processes simulataneously.
        • GNU Parallel


    1. Why not use GNU Parallel?

      parallel -j0 echo -n Hostname: {}\; curl{} '|' grep Status :::: hostlist

      or use --tag:

      parallel -j0 --tag curl{} '|' grep Status :::: hostlist

      It is shorter. It does not fail if the hostlist contains more hosts than number of processes. There is not even a theoretical chance for race condition.

      Watch the intro videos to learn more:

      1. I didn't know that `parallel` existed... Thanks a bunch, mate!

      2. Hi, thanks for reading and commenting!

        The reason I didn't go with parallel is because its not that popular yet in enterprise environments that have RHEL5 etc.

        The script I had written was mainly for enterprise environments and parallel wasn't an option!

        Thanks for the examples though.

      3. There is no doubt that xargs -P has problems with mixed output (see for example), and that GNU Parallel has not.

        GNU Parallel can work as a single script solution and can be installed by 2 simple commands:

        chmod 755 parallel

        I am trying to understand why you prefer the script above over GNU Parallel. I understand that you believe GNU Parallel is not popular in enterprise environments that have RHEL5, but I would think you will find a lot more RHEL5 installations that have GNU Parallel installed than your script - either as a package or as a single script solution.

        Can you elaborate on why the script you have written is better for enterprise environment than a tool that has had more than 10 years of intensive testing?

      4. The purpose of this blog post wasn't to show the best solution out there. It was to show how I had solved a problem.

        By posting it, I came across better solutions and in-turn improved mine :-)

        The reason I couldn't use gnu parallel was legal (licensing of GPL3). I wanted to solve a problem using the tools at hand and my solution worked well.

        Hence I decided to post it, following which, I found some shortcomings and I improved upon it. Yes GNU Parallel IS A BETTER SOLUTION. I will add a script that demonstrates it shortly.

    2. I have changed the blog suject to 'Multiprocessing' as multithreading gives a wrong idea of actual threads that can share resources.

    3. Hello, last days I trying to use xargs and perl together to do substitution in parallel.
      I use
      Secuential mode:
      perl -pe 's/ ?(\d{18,})(:? +)(.*)$/\1\3\(\1\)/g' $vFile > ${vFile}_sec
      Parallel mode:
      cat $vFile | xargs -d "\n" -n 1 -P $vLimit -I"LIN" bash -c 'echo "LIN" | perl -pe "s/ ?(\d{18,})(:? +)(.*)$/\1\3\(\1\)/g"' > ${vFile}_par

      On the proofs the secuential mode is faster than the parallel mode.
      ¿Could you help me to fix that?
      Note: Is not possible to install GNU Parallel.

      Regards, Misael.