I work for a manufacturing company that sells to distributors and occassionally direct to consumers. We deal with a lot of mom and pop stores and many are completely off the grid, no web presense at all. Unfortunately for us, those companies are the most likely to try/buy our product. We search all over the internets, forums, tradeshows, everywhere and find new stores to sell to all the time, but we know there are just so many we are missing.
Whenever we sign up a new dealer we place their contact details in our dealer locator, right on our website so the public can find our product. Light bulb! Our competitors also do this. We seem to have a never ending supply of competitors and they all have a few thousand dealers listed on their site. How to harvest all of this data and get it in front of our sales men without tipping off the competitor? Do I even care if the competitor knows I was mining their public list?
Rather than spend a good amount of time writing something complicated to scrape every site, I found that most of these dealer locator scripts function similarly. Some are more than generous and store a nice XML file on their server. Others require a bit more automation and require you to search by zip. The good ones require you to search by zip and include latitude and longitude. One thing they all have in common is search radius. This allows a would be customer to search for a store within 20 miles of their house.
This is where things got good. To run every zip code (single thread so not to trigger a DOS/rate limit) it could take as many as 6 days per website. Which, funny enough would have been fine since the sales people can’t call people as fast as I can pull the lists, even with that slow RPM. Either way, I knew I could do better. I found that most of the search systems allow you to select a radius and that sometimes you can tamper the get/post data to allow for a 99999 mile radius. Those were fun and never took long. Others would limit you to 100 or 200 miles. So why would I want to check EVERY zip code when I have a 200 mile radius? So with a “bit” of code and some help, I was able to generate an optimized list of zip codes that could cover the United States without much overlap. My list of zip codes to check went from 42,522 down to 100 given a 200 mile radius. I can now run through the list in a matter of minutes and scrape all of the data efficiently.
Next time you need a public dealer locator list dumped so your sales guys can start making some money, just shoot me an email (chad (a_t) outkastz (d.o.t) com), I’ve gotten pretty good at it.