Detecting Link Manipulation and Spam with Domain Authority

Because I was a user of the API, I couldn’t conduct as much research as I wanted to because I didn’t have access to deeper analytics, measures, and methods for finding strange patterns in backlink profiles. We tried these methods to find out whether backlinks were being manipulated using programs like Remove’em and Penguin Risk, but they never worked as well as they might have because of the limits of APIs that are open to the public.

Also, they didn’t grow. It’s one thing to get all the backlinks for a site, even a big one, and look at each one to see what kind of source it is, how good it is, what the anchor text is, and so on. If you don’t mind waiting a few hours for the report to finish, you can get reports like this from dozens of other companies. But how can you do this every day for 30 trillion links?

Since Link Explorer came out and I moved to Moz, I’ve had access to a lot less filtered data. This has given me a more clearer and deeper view of the tools that backlink index maintainers have to find and stop manipulation. I don’t want to suggest that all manipulation can be found, but I do want to show you some of the many unexpected ways to find spam.

The general methodology of Detecting Link Manipulation and Spam

You don’t have to be a math expert or a data scientist to comprehend this easy way to find link spam. There is a lot of arithmetic involved in measuring, testing, and making practical models, but the main idea is quite clear.

You can learn about the first step here: getting a solid random sample of links from the web. But let’s say you’ve already done that. Then, you find out what is usual or anticipated for each attribute of those random links, such as DA, anchor text, and so on. Lastly, you search for outliers and see if they are related to anything essential, like sites that are messing with the link graph or sites that are very excellent. Let’s begin with a simple example: link degradation.

Link decay and link spam

Link decay happens when links naturally stop working or change their URLs. If you acquire links after sending out a press release, you can anticipate some of them to go away over time when the sites are archived or taken down because they are outdated. If you got a link from a blog article, you may expect to see a link to your homepage on the blog until fresh postings move that post down to the second or third page.

But what if you paid for your links? What if you have a lot of domains and all of the sites connect to each other? What if you have a PBN? These links don’t usually break. When you have control over your incoming connections, you can frequently protect them from ever becoming bad. So, we may make a straightforward guess:

Hypothesis: The link decay rate of sites that change the link graph will be different from sites that have natural link profiles.

The way we test this hypothesis is the same as we spoke about earlier. We first find out what is normal. How fast do links on a random site become bad? We just go to a lot of sites and keep track of how many links they have and how quickly they are erased (we go to a page and observe that a link is gone). We may then seek for things that don’t fit.

I’m going to make this scenario of anomaly searching really simple. No numbers or arithmetic, just a quick glance at what happens when we first sort by Lowest Decay Rate and then by Highest Domain Authority to discover who is at the bottom of the list.

Yes! It seems like every strong DA score with no link decay is driven by some kind of link network. This is the “Aha!” moment in data science that is so exciting. It’s noteworthy that we discover spam on both extremities of the distribution. In other words, sites that have 0 decay or close to 100% decay rates are equally likely to be spammy.

The first kind is usually part of a link network, while the second kind usually sends spammy backlinks to sites that are already getting spammy backlinks, which makes their connections swiftly move to other websites.

Now we have to undertake the hard work of developing a model that really takes everything into consideration and correctly lowers Domain Authority based on how bad the link spam is. But you may be wondering…

These sites don’t show up in Google, therefore why do they have good DAs?

This is a typical issue with training sets. DA learns from sites that rank well in Google so we can see who will rank higher than others. But in the past, we haven’t (and to my knowledge, no one else in our field has) thought about random URLs that don’t rank at all. We’re working on this in the new DA model that will come out in early March. Stay tuned, since this is a big step forward in how we figure out DA!

Spam Score distribution and link spam

The Spam Score is one of the most interesting new features of Domain Authority 2.0. Moz’s Spam Score is a measure that tells you how likely it is that a domain will be indexed in Google. It doesn’t require links at all. The site is worse if it has a higher score.

We could just ignore links from sites with Spam Scores over 70 and be done with it. But it turns out that common link manipulation schemes leave behind interesting patterns that can be found by using this simple method of taking a random sample of URLs to see what a normal backlink profile looks like and then checking to see if there are any strange patterns in the way Spam Score is spread among the backlinks to a site. Let me show you one.

It turns out that being natural is quite hard. Even the finest efforts frequently fail, as this really bad link spam network did. This network had been bothering me for two years since it included a list of the top million sites. If you were one of those sites, you might see anything from 200 to 600 following links in your backlink profile. I named the network “The Globe.”

It was simple to look at the network and see what they were up to, but could we find it automatically so that we could lower the value of other networks like it in the future? The Spam Score distribution flared up like a Christmas tree when we looked at the link profiles of sites that were part of the network.

Most websites acquire most of their backlinks from domains with low Spam Scores. As the Spam Score of the domains goes higher, they get less and fewer backlinks. But this link network couldn’t hide since Spam Score showed us that the sites in their network had quality problems.

We would never have found this problem if we had merely ignored the negative Spam Score links. We identified a wonderful way to tell which sites are likely to be in trouble with Google for poor link building.

DA distribution and link spam

We can see similar trends in how incoming Domain Authority is spread among sites. Setting minimum quality criteria for outreach programs, usually DA30 and above, is a frequent way for firms to try to improve their rankings. Sadly, this has led to sites with modified link profiles that are quite easy to see.

Please let me be clear for a moment. It’s not always against Google’s rules to have a modified link profile. If you perform focused PR outreach, it’s fair to suppose that this kind of distribution may happen without trying to change the graph. But the real issue is whether Google wants sites who do this kind of outreach to do better. If not, Google can easily stop or even ignore this obvious case of link manipulation.

A regular link graph for a site that isn’t trying to gain high link equity domains will show that most of its links come from DA0–10 sites. There will be a little less for DA10–20, and so on, until there are essentially no connections from DA90+. This makes sense since there are a lot more low DA sites than high DA sites on the web. But all of the sites above have strange link distributions, which makes it simple to find and fix link value on a large scale.

I want to make it clear that these are not always instances of breaking Google’s rules. But these are changes to the connection graph. You have to decide whether you think Google takes the effort to figure out what kind of outreach caused the strange link distribution.

What doesn’t work

We throw out dozens of link manipulation detection methods for every one we find. A few of them are very shocking. Let me write about just one of them.

The first startling thing was the number of nofollow links compared to follow links. It seems quite clear that spammers who leave comments, post in forums, and do other things like that will wind up with a lot of nofollowed links, making it simple to establish a trend. Well, it turns out that this isn’t true at all.

The ratio of no follow to follow links isn’t a good way to tell whether a site is spammy, since prominent sites like facebook.com frequently have a greater ratio than even pure comment spammers. This is probably because people use widgets and beacons and leave comments on famous sites like facebook.com that are real. This isn’t always the case, however.

Some sites have a lot of root connecting domains and 100% nofollow links. It’s easy to spot certain oddities, like “Comment Spammer 1,” but the ratio isn’t a useful way to tell spam from ham in general.

So what’s next?

Moz is always searching for methods to boost DA/PA Checker by going over the link graph and employing anything from fundamental linear algebra to complicated neural networks. We have a clear goal: to create the finest Domain Authority measure ever. We want a statistic that people can trust over time to find spam, exactly like Google does, and that also helps you figure out when you or your rivals are exceeding the rules, all while keeping or enhancing correlations with rankings.

Of course, we don’t hope to get rid of all spam; no one can achieve that. But we can do better. With the help of the amazing Neil Martinsen-Burrell, our measure will be the only one in the industry that is the official way to figure out how likely it is that a site will rank in Google.