877-8-GO 4 IAM
4  6  4  4  2  6

home > about us > Featured Geek

Featured Geek

Reliam's most important asset is our talented team of Internet Application Engineers. Every month exceptional individuals from our team are featured. They discuss issues or solutions that are dear to the featured geek's heart at the present moment. These issues just might be dear to your heart too.

Taking Your Script Multithreaded

By Brandon Burton, Internet Application Engineer

Challenge

November 22, 2010 - As part of a recent migration and upgrade, we needed to move approx. 160GB of data between two systems. At first glance, this seems like a relatively easy proposition. Just use robocopy or rsync or the like, and that is where we started.

At first we tried robocopy, since both systems were running Windows Server, but after robocopy ran for 12 hours, it was apparent that even though robocopy claims to be able to differential transfers, it either doesn’t, or is very inefficient about it.

After that, we looked at the various rsync on Windows options and decided to try DeltaCopy. We installed, setup, and ran DeltaCopy, this got our sync time down to 4-6 hours, depending on when we ran it, as load is most high during business hours.

We felt like this still seemed to high and investigated why, and as we evaluated the situation, it became apparent it was more complicated, for the following reasons:

  • The data set contained files ranging in size from a few hundred kilobytes to files in the 2-3GB size
  • Most of these files were constantly changing, the changes made to the files were in an append fashion, so your 2GB file would become 2.1GB, and so forth
  • Due to these constant changes and user interaction with the services running on the system, there was a fairly constant amount of random disk I/O
  • There was also a constant CPU utilization of around 30-40% with spikes up to 80-90% during peak load

Solution

So given all that, our thought was, how can we parallelize the workload. We considered the situation and decided to write a multithreaded script for the syncing. As I was tasked with this and I am a Python guy at heart, I chose to implement it in Python. This proved to be relatively easy to do, as Python has Thread and Queue libraries included with the standard libraries.

In summary, this script reads in a list of the directories in the source directory, then spawns off N number of parallel threads (N is controlled by a variable). It will skip certain directories, as defined in an exclude list. It continues to run threads until the queue is empty and all the sync threads have finished. Along the way, it logs some output to log files and outputs terser logging to the console.

What follows is a walk through of how I implemented this multithreaded sync script in Python.

The Code

Step one is to import the libraries we’ll need from the standard libraries

These are the libraries needed for the threading and the queue we’ll use to spawn off threads to sync each directory

from threading import Thread
from Queue import Queue

Then we need some libraries for dealing with interacting with the system, logging, and spawning off our sync process

import subprocess, os, sys
import time
import datetime

Next we define some variables and objects we’ll need for controlling our number of threads, queue, and syncing process * Since the rsync process we used is actually a tiny bundle of Cygwin, the source and destination paths are Unix style

num_threads = 75
queue = Queue()
rsync = "\"C:\\Program Files\DeltaCopy\\rsync.exe\" -v -rlt -z --chmod=a=rw,Da+x --delete"
source = "/path/to/source"
destination = "/path/to/destination"

Next we define some bits for logging, as this script ran on Windows, my log file paths are Windows style

now = datetime.datetime.now()
logfiledate = now.strftime("%Y-%m-%d %H-%M-%S")
logfile = "D:\\logs\\threads_log" + logfiledate + "_log.txt"

As there were a number of files in the root of the source directory, I chose to implement a small function to just get a list of the directories in the source, as I only want to sync the directories

def listdirs(folder):
return [d for d in os.listdir(folder) if os.path.isdir(os.path.join(folder, d))]

I call this function on the source to get my list, I also declare a list of folders I’m going to exclude, as there are certain folders in the root of the source I don’t want to sync

directories = listdirs(source)
exclude = []

Even though I’ve yet to declare my main function that will do most of the work, I print and log output that indicates things are starting

print "Beginning"
print datetime.datetime.now()
file = open(logfile, 'w')
file.write("Beginning\n")
file.write(str(datetime.datetime.now()) + "\n\n\n")
file.close()

Because while I develop I was trying different things, I put this pause in there, I found it useful, so I just left it in

print "pausing in case of debugging"
time.sleep(30)

This function is where we do most of the work, I’ve left in some lines that are for debugging output, but are commented by default, for a script this short, I find it easier to just comment them out, then put in conditional logic based on a verbose argument, or the like

def syncDirectory(i, queue):
"""syncsDirectory"""
#print "running thread",

When the function is called, a queue is passed to it, as long as the queue isn’t empty, the function will get a directory name from the queue and work on it

while queue.empty != True:
dir = queue.get()

If the directory from the queue is on our exclude list, we’ll skip it by marking it as done on the queue

if dir in exclude:
print "excluding", dir
queue.task_done()

If it’s not in our exclude list, then we start by logging that we’re beginning syncing that particular directory and proceed to sync it. One thing to note is that this script both logs a summary to one log file, and a log file of the rsync output for each directory.

What and how much you log is a tricky balance, I try to err on the side of too much, as there are always issues that need to be debugged and edge cases you didn’t think of, like weird characters in file/directory names. Finally, I sleep for one second to try and keep the console output in order, as many of the directory syncs finished in less than a second and since the script is threaded, when things get outputted to the console is non-deterministic.

else:
file = open(logfile, 'a')
file.write(" Beginning"+ dir +"\n")
file.write(" " + str(datetime.datetime.now()) + "\n")
file.close()

print "syncing", dir
#print " " + source + "\\" + dir
#print " " + destination + "\\" + dir
src = source + "/" + dir + "/"
dest = destination + "/" + dir + "/"
command = rsync + " " + src + " " + dest
print command

now = datetime.datetime.now()
logfiledate = now.strftime("%Y-%m-%d %H-%M-%S")

ret = subprocess.call(command,
shell=True,
stdout=open('D:\\'+ dir + logfiledate + '-log.txt', 'a'),
stderr=subprocess.STDOUT)
#print "running dir on", dir
queue.task_done()
file = open(logfile, 'a')
file.write(" Finishing"+ dir +"\n")
file.write(" " + str(datetime.datetime.now()) + "\n")
file.close()
print "done syncing", dir
time.sleep(1)

Then we do some output to give an indicator of how many tasks in the queue remain

queuesize = queue.qsize()
file = open(logfile, 'a')
file.write(" " + str(queuesize) + " remaining \n")
file.close()

print queuesize, "remaining"

This preps our threads to run, as soon as we populate the queue

for i in range(num_threads):
worker = Thread(target=syncDirectory, args=(i, queue))
worker.setDaemon(True)
worker.start()

Next, We populate the queue

for dir in directories:
queue.put(dir)

Then, we kick it off

print "Main Thread Waiting"
queue.join()

Finally, we finish and log the finish time

print "Done"
print datetime.datetime.now()
file = open(logfile, 'a')
file.write("Done\n")
file.write(str(datetime.datetime.now()) + "\n")
file.close()

Conclusion

So that is the background and a walk through on how we built a multithreaded sync script. Taking the script multithreaded and doing test runs to see how many parallel threads we could run, which ultimately turned out to be 75, resulted in shrinking the sync window from 4 hours to 45 minutes, which shrunk the cutover window to an hour.

Obviously this is my first attempt at a multithread script and I learned a lot along the way, some of the things I’d like to improve and invite feedback on are

  • A better way to gauge the number of threads still running and which threads they are, as once the queue is empty, there are still a number of threads running. In the case of this migration, I just used pgrep and wc -l.
  • Suggestions on better logging, I know there is the logging library and others, I’ve just yet to take the time to play with them
  • Should I be using multiprocessing instead of Thread and Queue?

I wanted to point out that all of this is running Windows and Python is proving to be a great scripting tool for doing work on Windows.

Finally, I’ve put up the full script as a Github repo, my goal is rewrite the script as I get feedback on improvements, this will serve as an education exercise for myself and anyone who follows along. If I get a lot of feedback, I’ll do a follow up post in 1-2 months on what I’ve leared along the way.

Please provide your feedback via Twitter or Email


Archives

The Stages of Caching

By Christopher Evans, Internet Application Engineer

July 26, 2010 - We here at Reliam are in the business of scaling web applications. What does this mean. We work closely with developers on a constant basis to optimize websites and improve the page load times. One  such method of increasing the load time of assets and increasing overall website speed is caching; especially with a CDN.  User provided images are the easy part.  Next to address caching of JavaScript and CSS... Finally, caching of the html, including support for dynamic portions. Read more

 

Tuning LAMP Stack

By Chris Harshman, Internet Application Engineer

January 28, 2010 - One of the most critical aspects of Linux, Apache, MySQL, PHP[1] application management is tuning the software components that make up this "stack" of software.  Without proper benchmarking and tuning, your application could slow to a crawl under load, being bogged down by excessive swapping or lengthy database queries -- or worse, it could fail completely. Read more


Automation is the Cloud

By Brandon Burton, Internet Application Engineer

October 6, 2009 - There has been a lot of buzz, press coverage, and product offerings over the last year that all have something about "The Cloud" in them. I think it is important that as technologists and sys admins, we do what we can to bring clarity to what "The Cloud" is and how it affects and benefits you. This is my take on what "The Cloud" is and my attempt to bring some clarity to the discussion. Read more


Using Bash to Glue UNIX

By Nathan Rich, Internet Application Engineer

August 2, 2009 - Much like Perl is the glue language for programmers, BASH scripting is the glue language for UNIX and Linux operating systems themselves. When I was learning linux concepts I started by mastering BASH. I purchased the book, “Learning the BASH Shell” and went to work studying and applying the information within. I love working with BASH because the more you learn it the more you learn about UNIX and Linux's GNU tools. Thus by simply studying and applying one relatively easy language, you learn all of the common elements of interacting with UNIX and Linux. Read more

 

PARTNERS
Partners Net2Ez EdgeCast Content Delivery Network MicroSoft Certified Partner Fortinet Dell Brocade