Intro
Crawl Your site and prepare the data
Pages Crawled Vs. Pages That Actually Exist
Inbound Links Pointing to the Wrong Place
Parts of a Page Aren’t Being Crawled
Pages Not Present In the Crawl
Your Organization Does not Match Up with Users
Intro
According to SEOGadget’s richard baxter site architecture is “how users and search engines find their way around your site.” Good site architecture means that your site is easy to navigate for search engines and users. Bad site architecture can cause lower indexation, user frustration, and overall lower rankings.
Site architecture should be a reflection of how users search for your products. However, because websites are constantly changing, and growing site architecture can often get overlooked and neglected. This results in search engines not being able to access your entire site, users not finding what they need, and several other problems that will cost you traffic, and money at the end of the day.
Crawl your site and prepare the data
The first thing you’ll need to do is crawl your site with some sort of crawling program like Screaming Frog.

Once the crawl is complete, export that data to excel. You will then have a nifty little spreadsheet that should have the following information on it: The URLs of all your pages, the inbound links to each page, the outbound links to each page, and an indicator of page depth (Screaming Frog includes that in their crawl, if your crawler doesn’t don’t worry) The next step is preparing the results of this crawl so we can draw actionable information out of it. So the next step is to export the results of the crawl to Microsoft Excel. Now in it’s current state you could work out most problems, but it would time consuming.
Because you want this to be as quick and efficient as possible, you are going to need to make one change to the spreadsheet we are currently on. Insert columns to the right of the URL equal to the number of subdirectories, folders, and pages you have in the URL of your deepest page. So for a url that read: mysite.com/category1/subcategory1/subsubcategory1/content You would need to insert 4 columns to the right of the URL. Make sure each of these have a name in the topmost row where all of the other labels are.

Then highlight the entire URL row, find the text-to-column button, and split that column at the ‘/’. Now all of the columns to the right of the URL should be filled with the directories and subdirectories of your website.

Now we are almost ready to start auditing. The last step is to put all of that data in a pivot table.
What to look for:
Now that you have the data in a form that can actually be useful to you, you need to look at the following points in the pivot table.
Inbound Links
Screaming Frog won’t tell you about external links pointing to different pages, but it does tell you about the inbound links to a page from your own site. Internal linking is a good indicator to the search engines for determining what you think is important content on your website. Each link is a vote towards a piece of content. So drop the category all the URL headings in the ‘Row Label’ section & drop the ‘Inlinks’ in the Value section, be sure to change the value from count to average. Now you have the average of inbound links for each page in a category, subcategory, etc. etc.
Outbound links
External outbound links are a great way to add validity and value to the content on your website, there is even some evidence to suggest that Google may take into account the quality of the pages you are linking to when calculating rankings. However, too many links can waste a search engine’s and a user’s time without really adding any value. This also a great way to make sure all of the page is being crawled accurately. Drop the category all the URL headings in the ‘Row Label’ section & drop the ‘External Outbound links’ in the Value section, be sure to change the value from count to average. Now you have the average of external outbound links for each page in a category, subcategory, etc. etc.
Page Depth
Page depth is the level at which a page was found. So if you have your homepage, and then a category page all of the content on that category page would be on the second level of your site. Luckily Screaming Frog records the depth of each page.However if you use a crawler that doesn’t, simply check the number of folders that come before the content. That is a good measure of depth. Now Screaming Frog sets the top level of your page(your homepage) as 0 so keep that in mind. Here we are only going to take the column that has the content page on it, and put that in the ‘Row Label’ section. Then we will take the depth column and put that in the ‘value’ section. This will tell you how to deep your pages are.
Pages Crawled vs. Page that Actually Exist
Sometimes content we put on our website is not accessible to crawlers. This could be because it is not linked to properly in the rest of the site, or the content is too deep in the site and the crawlers don’t have the time to get to it. Here all we need to do is look at the number of rows we have in our spreadsheet, and then compare that to the pages we know we have on our site.
Crawl Paths
A crawl path is the way that a spider got to a specific content page; it is best represented by the URL. So if Googlebot comes in and finds your content page three levels down in subdirectory B, that is the content’s crawl path. Now a lot of content whether it be products or blogs will fit into more than one category. Even if that is the case, each piece of content should only have one distinct crawl path. You should be able to get to the content through internal links, but each piece of content should have one distinct URL. Now we are going to take the content page column drop it in both the ‘Row Label’ and ‘Values’ section of the pivot table. The value should default to count, however if it doesn’t be sure to change it. This will give a count of how many times the spider found that specific piece of content through different URLs.
Site Organization
Your website’s organization should be intuitive, and easy to use. To really make sure you’ve got this right you will need to get flash cards or pieces of paper. Write all of your major categories subcategories on separate note cards. Then go find a potential customer, and ask them to organize them how they think should be organized. Obviously doing this with a large sample is better than one person. However, you should be able to get the idea. Once you’ve seen how your customer organizes your site. Go to your pivot table, and drop all of the pieces of your URL in the ‘Row Label’ section starting from largest directory working down to the content page. Now we want to check to see if those two visualizations match up.
Common Problems:
Inbound Links Pointing to the Wrong Place
One thing you may encounter as you look at where your inbound links are pointing is that they may be pointing at the wrong page. Each link is a vote for importance. If you have one page that is absolutely not valuable, but has more votes than the page that is valuable, search engines will see the first page as more valuable. This can lead to some very poor traffic and rankings.
Too Many Outbound Links
There use to be a fuzzy rule that said about 100 links was about the max Google would crawl on any given page. Since then the rule has gotten fuzzier, but now there is some evidence to suggest the number is now around 150. Check to see which of your pages have the most outbound links, and try to keep the number reasonable, and well under 150.
Parts of a Page aren’t being crawled
If you know that there is a section of a certain page that has a lot of links on it such as a widget, or sidebar, and they are not showing up in the link count, then there is probably something keeping a web spider from crawling that specific section of the page. There could be several things going on.
Duplicate Crawl Paths
A duplicate crawl path is when a piece of content fits into more than one category, and your CMS create two distinct URL’s to get to the same page. This costs you ranking potential, and can create duplicate content issues on your site. Now each URL is like an home address. If you live in an area that has recently been annexed by a large city, you may face a similar problem to your website. Because some of your mail will say 1234 Lane, Big City and other mail may say 1234 Lane, Small Suburb. Anyone whose been to your house knows it’s the same house, but in a computer those would be two different locations. If a content page has a count greater than 1, then you have duplicate URL Paths.
What a User sees:

What Google sees:

Pages Not Present in the Crawl
If you have pages that are missing from your crawl there could be a variety of reasons. You could be accidentally blocking them with your robots.txt, or you could simply not be linking to them. Depending on the CMS this could change. You will need to navigate the page directly to see what is going on.
Your site is Too Deep
If it takes you more than three clicks to get to the deepest page in your site, then your site is too deep. Be sure to check for the biggest value in the ‘Depth Column’. Anything greater than 2(remember home page is marked as 0) is most likely too deep. Some pages are ok if they are buried deep in your site, such as admin log ins, privacy policy, etc. Those pages are important, but they will rarely be making you any money, or bring in tons of traffic from the search engines.
Your Organization is Does Not Match Up With Users
If your organization is vastly different from the results you got from your flash card test, then you probably need to rethink how you have your site organized. This may include re-categorizing content pages, or re-organizing your hierarchy of categories.
How to fix common problems:
Remove/Rewrite URL Paths
If you find that your site has duplicate crawl paths to a URL, then you may need to do a few different things depending on your CMS. You may need to rewrite your current URL structure to facilitate that unique address while it can be a pain in the butt. You can either hire someone on ODesk to do it, or follow this guide to URL rewrites for beginners by addedbytes.com. If none of that sounds overly appealing then I would suggest setting up a rel=canonical for each content page pointing to the most relevant URL. In the past, this is how sites like Zappo’s have dealt with this issue.
Bring Pages up in Your Site
This may also consist of changing your URL structure however this usually just means changing your settings in your CMS, or combining some of your smaller categories. The ideal structure is flat, and about two levels deep. So your content will still be accessible via categories, but will be easier to access for users and search engines.
Adjust Internal linking
This can solve many of the problems listed above. To get the best bang for your buck there are a few areas you should focus on changing your links. The first is your navigation. Your Navigation is usually on every page (if it’s not you have other problems). This is a great opportunity to include valuable pages in your top-level navigation. The next area would be widget or sidebars. If you have a piece of content, or a product that is popular & doesn’t change all that much, Consider putting a link to that in the sidebar area of your different pages. This way you are pointing every page to that page, giving it more links (votes for importance). The final area I would check immediately is the footer. Be sure to include easy information in the footer. So instead of having a link to the ‘contact us page’, think about putting contact info actually in the footer. That way it makes it easier for users to find, and eliminates all those links pointing to a page, that for all purposes will give minimal benefits. Be sure to check out this video on SEOMoz
Re-organize Content
If what you thought your organization should be is way different from what your customers think it should be, then you will probably need to re-categorize your content. It’s not fun, and it may be time consuming. However, when your site more accurately reflects the way a customer would use it, you will see an increase in metrics like time on site, lower bounce rates, and most likely higher search engine rankings.




