Reliam's most important asset is our talented team of Internet Application Engineers. Every month exceptional individuals from our team are featured. They discuss issues or solutions that are dear to the featured geek's heart at the present moment. These issues just might be dear to your heart too.
Taking Your Script Multithreaded
By Brandon Burton, Internet Application Engineer
November 22, 2010 - As part of a recent migration and upgrade, we needed to move approx. 160GB of data between two systems. At first glance, this seems like a relatively easy proposition. Just use robocopy or rsync or the like, and that is where we started.
At first we tried robocopy, since both systems were running Windows Server, but after robocopy ran for 12 hours, it was apparent that even though robocopy claims to be able to differential transfers, it either doesn’t, or is very inefficient about it.
After that, we looked at the various rsync on Windows options and decided to try DeltaCopy. We installed, setup, and ran DeltaCopy, this got our sync time down to 4-6 hours, depending on when we ran it, as load is most high during business hours.
We felt like this still seemed to high and investigated why, and as we evaluated the situation, it became apparent it was more complicated, for the following reasons:
Solution
So given all that, our thought was, how can we parallelize the workload. We considered the situation and decided to write a multithreaded script for the syncing. As I was tasked with this and I am a Python guy at heart, I chose to implement it in Python. This proved to be relatively easy to do, as Python has Thread and Queue libraries included with the standard libraries.
In summary, this script reads in a list of the directories in the source directory, then spawns off N number of parallel threads (N is controlled by a variable). It will skip certain directories, as defined in an exclude list. It continues to run threads until the queue is empty and all the sync threads have finished. Along the way, it logs some output to log files and outputs terser logging to the console.
What follows is a walk through of how I implemented this multithreaded sync script in Python.
Step one is to import the libraries we’ll need from the standard libraries
These are the libraries needed for the threading and the queue we’ll use to spawn off threads to sync each directory
from threading import Thread
from Queue import Queue
Then we need some libraries for dealing with interacting with the system, logging, and spawning off our sync process
import subprocess, os, sys
import time
import datetime
Next we define some variables and objects we’ll need for controlling our number of threads, queue, and syncing process * Since the rsync process we used is actually a tiny bundle of Cygwin, the source and destination paths are Unix style
num_threads = 75
queue = Queue()
rsync = "\"C:\\Program Files\DeltaCopy\\rsync.exe\" -v -rlt -z --chmod=a=rw,Da+x --delete"
source = "/path/to/source"
destination = "/path/to/destination"
Next we define some bits for logging, as this script ran on Windows, my log file paths are Windows style
now = datetime.datetime.now()
logfiledate = now.strftime("%Y-%m-%d %H-%M-%S")
logfile = "D:\\logs\\threads_log" + logfiledate + "_log.txt"
As there were a number of files in the root of the source directory, I chose to implement a small function to just get a list of the directories in the source, as I only want to sync the directories
def listdirs(folder):
return [d for d in os.listdir(folder) if os.path.isdir(os.path.join(folder, d))]
I call this function on the source to get my list, I also declare a list of folders I’m going to exclude, as there are certain folders in the root of the source I don’t want to sync
directories = listdirs(source)
exclude = []
Even though I’ve yet to declare my main function that will do most of the work, I print and log output that indicates things are starting
print "Beginning"
print datetime.datetime.now()
file = open(logfile, 'w')
file.write("Beginning\n")
file.write(str(datetime.datetime.now()) + "\n\n\n")
file.close()
Because while I develop I was trying different things, I put this pause in there, I found it useful, so I just left it in
print "pausing in case of debugging"
time.sleep(30)
This function is where we do most of the work, I’ve left in some lines that are for debugging output, but are commented by default, for a script this short, I find it easier to just comment them out, then put in conditional logic based on a verbose argument, or the like
def syncDirectory(i, queue):
"""syncsDirectory"""
#print "running thread",
When the function is called, a queue is passed to it, as long as the queue isn’t empty, the function will get a directory name from the queue and work on it
while queue.empty != True:
dir = queue.get()
If the directory from the queue is on our exclude list, we’ll skip it by marking it as done on the queue
if dir in exclude:
print "excluding", dir
queue.task_done()
If it’s not in our exclude list, then we start by logging that we’re beginning syncing that particular directory and proceed to sync it. One thing to note is that this script both logs a summary to one log file, and a log file of the rsync output for each directory.
What and how much you log is a tricky balance, I try to err on the side of too much, as there are always issues that need to be debugged and edge cases you didn’t think of, like weird characters in file/directory names. Finally, I sleep for one second to try and keep the console output in order, as many of the directory syncs finished in less than a second and since the script is threaded, when things get outputted to the console is non-deterministic.
else:
file = open(logfile, 'a')
file.write(" Beginning"+ dir +"\n")
file.write(" " + str(datetime.datetime.now()) + "\n")
file.close()
print "syncing", dir
#print " " + source + "\\" + dir
#print " " + destination + "\\" + dir
src = source + "/" + dir + "/"
dest = destination + "/" + dir + "/"
command = rsync + " " + src + " " + dest
print command
now = datetime.datetime.now()
logfiledate = now.strftime("%Y-%m-%d %H-%M-%S")
ret = subprocess.call(command,
shell=True,
stdout=open('D:\\'+ dir + logfiledate + '-log.txt', 'a'),
stderr=subprocess.STDOUT)
#print "running dir on", dir
queue.task_done()
file = open(logfile, 'a')
file.write(" Finishing"+ dir +"\n")
file.write(" " + str(datetime.datetime.now()) + "\n")
file.close()
print "done syncing", dir
time.sleep(1)
Then we do some output to give an indicator of how many tasks in the queue remain
queuesize = queue.qsize()
file = open(logfile, 'a')
file.write(" " + str(queuesize) + " remaining \n")
file.close()
print queuesize, "remaining"
This preps our threads to run, as soon as we populate the queue
for i in range(num_threads):
worker = Thread(target=syncDirectory, args=(i, queue))
worker.setDaemon(True)
worker.start()
Next, We populate the queue
for dir in directories:
queue.put(dir)
Then, we kick it off
print "Main Thread Waiting"
queue.join()
Finally, we finish and log the finish time
print "Done"
print datetime.datetime.now()
file = open(logfile, 'a')
file.write("Done\n")
file.write(str(datetime.datetime.now()) + "\n")
file.close()
So that is the background and a walk through on how we built a multithreaded sync script. Taking the script multithreaded and doing test runs to see how many parallel threads we could run, which ultimately turned out to be 75, resulted in shrinking the sync window from 4 hours to 45 minutes, which shrunk the cutover window to an hour.
Obviously this is my first attempt at a multithread script and I learned a lot along the way, some of the things I’d like to improve and invite feedback on are
I wanted to point out that all of this is running Windows and Python is proving to be a great scripting tool for doing work on Windows.
Finally, I’ve put up the full script as a Github repo, my goal is rewrite the script as I get feedback on improvements, this will serve as an education exercise for myself and anyone who follows along. If I get a lot of feedback, I’ll do a follow up post in 1-2 months on what I’ve leared along the way.
Please provide your feedback via Twitter or Email
The Stages of Caching
By Christopher Evans, Internet Application Engineer
July 26, 2010 - We here at Reliam are in the business of scaling web applications. What does this mean. We work closely with developers on a constant basis to optimize websites and improve the page load times. One such method of increasing the load time of assets and increasing overall website speed is caching; especially with a CDN. User provided images are the easy part. Next to address caching of JavaScript and CSS... Finally, caching of the html, including support for dynamic portions. Read more

Tuning LAMP Stack
By Chris Harshman, Internet Application Engineer
January 28, 2010 - One of the most critical aspects of Linux, Apache, MySQL, PHP[1] application management is tuning the software components that make up this "stack" of software. Without proper benchmarking and tuning, your application could slow to a crawl under load, being bogged down by excessive swapping or lengthy database queries -- or worse, it could fail completely. Read more
Automation is the Cloud
By Brandon Burton, Internet Application Engineer
October 6, 2009 - There has been a lot of buzz, press coverage, and product offerings over the last year that all have something about "The Cloud" in them. I think it is important that as technologists and sys admins, we do what we can to bring clarity to what "The Cloud" is and how it affects and benefits you. This is my take on what "The Cloud" is and my attempt to bring some clarity to the discussion. Read more

Using Bash to Glue UNIX
By Nathan Rich, Internet Application Engineer
August 2, 2009 - Much like Perl is the glue language for programmers, BASH scripting is the glue language for UNIX and Linux operating systems themselves. When I was learning linux concepts I started by mastering BASH. I purchased the book, “Learning the BASH Shell” and went to work studying and applying the information within. I love working with BASH because the more you learn it the more you learn about UNIX and Linux's GNU tools. Thus by simply studying and applying one relatively easy language, you learn all of the common elements of interacting with UNIX and Linux. Read more
|