More articles

Battle of the Bots

Written by
Filed under
Published on
Modified on

And not at all in a cool way unfortunately. But in a more "that's kind of annoying..." way. For the most part, bots roam the interwebs hourly, daily, yearly without anyone noticing and they will continue to do so until the end of machine time without anyone's consent and knowledge. Think of SkyNet, but without any really useful powers. Some are good and some are bad. Some are smart, and some just eat until they fill up their logs. Much like the human does.

But they're a problem, and depending on how you see things, it could be kind of a big problem. As an example, just recently I began looking at several of my websites internal logs. And was somewhat curious as to what I saw. The numbers just didn't quite add up. So I did a little digging around and began to filter out some "bad" data. Alot of bad data. And what I found was interesting, to me, and I hope to you too. So I'll be going over that today.

I for one, welcome our robot overlords

There's alot of bots

Batte Bots

Just checking my logs for a few months, yielded about 100 "bot" like matches. All different, and all hungry. And all, kind of inflating my numbers a bit. And while my numbers aren't massive by any means, it still does indeed have an influence on how I work. For example, on a few of my websites, pages that I assume have higher traffic will normally get more frequent content updates and a little more TLC. So it's important to know which pages are popular and for that I rely on some pretty basic tracking. Essentially, you view the page, it's a +1 for the page. And for a few minutes any subsequent views won't be counted. Pretty simple right. I thought so too.

Batte of the Bots
Batte of the Bots

But maybe I should of paid just a it more attention to that data. Because as it turns out, a very large, very large percentage is probably not human. And after purging said data, things start to look way different than I had first pictured them to be.

It's not just my problem

Many large sites have gone out of their way to begin to delete, ban, truncate and flag any "unusual" users on their websites. YouTube for example, has purged accounts and views for years now in an attempt to combat the problem. More than likely, many of those are bots. Possibility, not all. But it did have an effect. For one, View counts went down. And in a world where view counts represent ones level of success, and in how many people value others, this could be influential. And if you're an actual corporation it's even worse, because those numbers can have a financial weight and thus introduce some legal problems into the mix.

The same goes for online polls, shares, clicks. We can't really know what's accurate anymore. And just how bad that problem is internal and up to each and every company. But I'm no company. So I can safely say, that a good 40-50% of my "Views" are from bots. Many many bots, from many many parts of the world. Some well hidden, and some not so much. And on some sites, its even worse, as you saw in the numbers above.

Google Analytics does a fine job of filtering out most bots, probably. Which is great for tracking sitewide stats. But what about those view and share counts adorning most sites? That's a different story. Those are custom in-house numbers acquired in many different ways. Some sites do in fact use 3rd party tracking tools, which may or may not be more accurate. It really is difficult to say for sure.

What is a bot

So let's start here. What is a bot. Well, in it's simplest state, a bot is an HTTP request of some kind created with a program. They can do many different things, from reading a page, to attempting to login to a form, to just simply being there in large number in what many would call a DDOS attack. But, they're not all bad.

The good guys

Batte Bots

Many bots are designed to read your page content, parse it, analyze it and serve it up to users in search engines. Awesome job little dude! Many of the big sites out there have bots, and we wouldn't have a web without them probably. This is the reason that Google search is so accurate.

The bad ones

Batte Bots

Then on the other end of that spectrum, and probably in large quantities, you have your more notorious bots. They'll try any of the following.

  • Steal content
  • Submit forms
  • Get into your Analytics
  • Take up resources

And those are just things that I've seen. I'm sure there are some pretty complex bots out there looking to take advantage of some loop hole online.

Battle bots

So how do you battle these bots? Well, there's a few ways to try to mitigate the problem. And while it's impossible to remove all traces of bot traffic, you can trim it down a bit. For one, they usually leave data behind such as their browsers and where they came from.

You can take a look at the UserAgent or their Referrer and look for names pertaining to bots and crawlers. A few examples are:

  • bot
  • spider
  • crawler
  • scraper

But each site is different. And bot traffic definitely depends on the type of content that you're serving up. For example, this blog has a much lower percentage of bots than some of my other sites, possibly due to the lower amount of content. I can only write so much.

Why is it important?

Maybe I'm overreacting you're saying to yourself. Maybe these bots are harmless and just want that sweet sweet data. Maybe you're right. And some bots we do wan't on our sites. And the fact that YouTube went so long without purging bot data is a good telling at just how low on the radar of importance this lies in. And it's one of those things, where sure, you can ignore it for a bit and just enjoy the inflated traffic numbers. But eventually, it's going to catch up.

For one, most companies don't have unlimited storage space and bandwidth. And the bigger they are, the more bots will have an impact. Just imagine that 60% of your bandwidth was dedicated to serving up pages to bots. On a small website that's no big deal. But on a website with millions of monthly users, you begin to see where I'm going with this.

the best solution is to ignore

Out of sight, out of mind

What I learned after years of filtering and blocking and redirecting, was that it's a never ending battle. As long as you ignore those pesky bots, you'll be just fine. That whole out of sight out of mind thing actually works great in this situation. I had a backwards approach the whole thing, in where I tried to battle the bots. For years. I blacklisted them, i redirected them, I served them pages with just 1 pixel. It was a war of attrition.

Now I don't track them. I don't pay any attention to them. They have no say in my view counts, and life is better. Or quieter anyway. As time goes on however, maybe bots will become a bigger problem, and maybe one day we won't be able to tell the difference between the blog post with 44 views and 350,523.

Walter Guevara is a Computer Scientist, software engineer, startup founder and currently mentors for a coding bootcamp. He has been creating software for the past 15 years.
Buy me a coffeeBuy me a coffee

Tags

Security
Land your next big coding job. Search through 1000's of job listings.

Discussion / Comments / Questions

No messages posted yet

Add a comment

Send me your weekly newsletter filled with awesome ideas
Post comment