Mastodon

Return to table of contents

Keylime Toolbox Crawl Analytics uses your server logs (web access logs or equivalent) to determine what URLs search engines are crawling and what technical issues the search engine bots are encountering that prevent complete indexing of the site.

Integrating your server logs with Keylime Toolbox involves stripping all non-bot entries from the logs (so that they contain no personally identifiable information) then uploading the logs to an AWS S3 bucket. Ideally, logs are uploaded each day (or more often) for each server.

Frequently Asked Questions
Getting Started
Automatic Uploads

Frequently Asked Questions

Why Does Keylime Toolbox Need Server Logs Uploaded?

Web access logs are the only place to see exactly what URLs the search engine bots are crawling. This information is critical for SEO efforts and answers questions such as:

  • Is the site being fully crawled? If it’s not, the site may not be comprehensively indexed.
  • Which URLs are being crawled and which aren’t? Crawl patterns shed light on what pages search engines find important.
  • How often are URLs crawled? This information can help determine how long it will take for changes to be reflected in search engine indices.
  • Are search engines crawling the same page with different URLs? The site may have canonicalization issues causing crawl efficiency and PageRank dilution issues.
  • Are pages being crawled that should be blocked from indexing?
  • Are search engines getting server errors when crawling pages?

We Have Security Policies In Place That Prevent Us From Giving Out Personally Identifiable Information

Keylime Toolbox doesn’t use personally identifiable information (PII). As you’ll see below, it’s best if you provide us with a filtered file that contains only search engine bot entries.

Who Has Access to This Data?

We store raw web access logs on Amazon’s S3. This data is available only to you, Keylime Toolbox staff who need access to monitor and support your log uploads, and Keylime Toolbox.

We Still Have Security Concerns

Amazon provides more details about the security of AWS, but please contact us at support@keylimetoolbox.com with any questions and we’ll be happy to provide further details.

How Long Do You Store Our Raw Log Data?

We store raw files for up to two weeks; processed output is retained in Keylime Toolbox so you can view historical trends. (See more details on the reports available.)

Getting Started

Getting set up for log file uploads involves the following steps, detailed below.

  1. Send us your AWS account identifier
  2. We’ll provide you the name of the bucket we created for you, authorized for your access
  3. Make sure your log files have the correct data fields in them
  4. Upload a sample log file to your bucket, setting permissions, so we can review
  5. Filter your log files for bot traffic and compress them
  6. Upload them daily (or more often) to the bucket

You will only need to do the first 4 steps once.

Getting Access to Your Keylime Toolbox AWS S3 Bucket

Keylime Toolbox sets up a private Amazon Web Services (AWS) S3 bucket for your account. As part of the initial configuration, you’ll need an AWS account. (You can create a free one if you don’t already have one). You will send us the account identifier so that we can give you permission to upload to your bucket. (Each customer has a separate bucket under our AWS account; no one can access your bucket other than you and we cover all charges related to the bucket.)

If you have not set up AWS previously, the simplest option is to send us the “Canonical User ID” of the AWS Account:

  1. Create a free Amazon Web Services account.
  2. Find your AWS account canonical user ID:
    1. Go to your AWS security credentials page.
    2. Expand the Account Identifiers section to access the canonical user ID.AWS Canonical ID
  3. Email your Canonical User ID to logs@keylimetoolbox.com.

If you are already using AWS and have implemented IAM Users, we can configure permissions for a specific User or Role.

  1. Create the User or Role in AWS IAM.
  2. Ensure that the User or Role has permissions to upload files to the S3 bucket that we provide for you. The specific permissions you need may depend on your tools but you will at least need PutObject and PutObjectAcl.
  3. Email the ARN for the User or Role to logs@keylimetoolbox.com.

Once we receive your Canonical User ID or IAM ARN, we’ll provide access to your Keylime Toolbox AWS S3 bucket and will send you an email. You can now begin uploading files.

The Log File

Keylime Toolbox supports web access logs in any format. These may come from your web server, load balancer, or something in between. The specific fields are listed under Log File Contents, below; however, the names we need may vary based on the format, so we can work with you on the details. Once we get the file, we’ll be in touch about this if needed.

Source Data Needed

  • Web access logs
  • Filtered to only bot traffic
  • Logs from all servers (if the site operates from multiple servers)

Log File Format Details

  • Log files can be in any format
  • If not all log files are in the same format, email logs@keylimetoolbox.com after the first upload to let us know
  • Compress each file using gzip compression. Each file should be compressed separately
  • Ensure files aren’t password protected
  • Each file should have no more than 24 hours’ data (it can have less)
  • Include the date (in any format) in the file name representing the date of data in the file
  • Name files in such a way that they don’t overwrite each other when uploading new files

File Size

Keylime Toolbox can process files of any size, but for ease of uploading, exclude non-bot traffic as described below.

Duration and Upload Frequency

Crawl Analytics work best with comprehensive server logs. Keylime Toolbox processes data nightly and displays the output in daily increments. For best results, set your script to upload logs daily so that you can view the most up-to-date analysis in Keylime Toolbox.

If you are unable to upload comprehensive logs, Keylime Toolbox can work with any duration you’re able to provide.

Bot Filtering

Keylime Toolbox extracts search engine bot traffic only, so you can filter out all entries other than these user agents:

  • googlebot
  • bingbot
  • msnbot
  • slurp
  • baidu
  • yandex
  • naver

The following case-insensitive, regular expression pattern will ensure the filtered data includes all variations of these user agents (for instance, Googlebot and Googlebot-Images):

googlebot|bingbot|msnbot|slurp|baidu|yandex|naver|^#Fields:

The regex pattern above needs to be used as a case-insensitive match, because the user agents will appear in the log files like this: Googlebot/2,1, msnbot, and Yahoo! Slurp. For example, you could use enhanced grep like this:

cat access.log | grep -iE 'googlebot|bingbot|msnbot|slurp|baidu|yandex|naver|^#Fields:' > filtered_bots.log

Log File Contents

You can upload the log files in their entirety, and Keylime Toolbox will extract what’s needed. However, if for space or privacy reasons, you’d rather filter the files before sending them, ensure the uploaded files contain:

  • Date of log entry
  • Time of log entry
  • URL requested
  • User agent of the requestor
  • HTTP response code
  • Hostname (ideally; required if the individual log files are for multiple hosts or subdomains)
  • IP address of requestor (ideally; but without this we can still provide most reports)
  • Protocol: http or https (without this we can still provide reports but they won’t provide as much insight)

Server-Specific Details

The field names for the data Keylime Toolbox uses vary by log file type. Below are some common log types. If you have a different log file format, email a 10-line sample log file to logs@keylimetoolbox.com and we can help you configure your server for the right format.

Apache Combined Log Format

The Apache combined log format contains all the fields Keylime Toolbox needs to provide a comprehensive view of search engine crawl behavior.

The default Apache configuration sets up combined format for logs which includes all the needed information. You should see something like this for each of your virtual sites in your apache configuration:

CustomLog /var/log/apache2/access.log combined

If you have a different log format and don’t need it to be custom, changing it to combined format will ensure Keylime Toolbox can process it.

Apache Custom Format

If you are using a non-standard Apache log format, below are the Apache log formatting macros for the fields that Keylime Toolbox uses.

%hIP address or PTR name of the client making the request
%tDate and time of log entry
%rRequest line, including method, URI, and protocol.
(This is the same format as “%m %U%q %H”)
%>sHTTP response code (such as 200 or 404.)
\"%{User-agent}i\"User agent making the request (such as Googlebot).
%vRequired if you are serving multiple virtual hosts; indicates the virtual host that the request is intended for.
If you are using a custom apache log file format, you should see two lines like this in your apache configuration (they may be in different places in the file or in different configuration files):

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{User-agent}i\"" vhost_custom
CustomLog /var/log/apache2/access.log vhost_custom
Apache Common Log Format

The Apache “Common” log format doesn’t contain user agent information, so it is not usable by the Keylime Toolbox. If the server is set up with this type of logging, we recommend you switch to Apache “Combined” log format (see above). If this isn’t possible, email logs@keylimetoolbox.com and we can work with you to determine a solution.

IIS Log Format

Below are the needed entries for IIS logs.

dateDate of log entry
timeTime of log entry
cs-uri-stemURL without query parameters
cs-uri-queryURL query parameters
c-ipIP address of the client making the request
cs(User-Agent)User agent making the request (such as Googlebot)
sc-statusHTTP response code (such as 200 or 404)
cs-hostRequired if you are serving multiple virtual hosts; indicates the hostname that the request is intended for
Akamai

If your site uses Akamai, contact them and ask them for logs containing the required fields above.

Fastly

If your site uses Fastly, you can configure an S3 Log Streaming Endpoint to send logs to your bucket. Please contact us if you wish to do this as we’ll have to create credentials for you to ensure the logs have proper permission when uploaded.

nginx Log Format

The default log format used by nginx includes the needed entries.
If you have defined your own nginx log file format you would have lines similar to this in your nginx configuration file:

log_format gzip '$remote_addr - $remote_user [$time_local]  "$request" $status $bytes_sent "$http_referer" "$http_user_agent"';
access_log  /spool/logs/nginx-access.log  gzip  buffer=32k;

The following fields are required for nginx:

$remote_addrIP address or PTR name of the client making the request
$time_localDate and time of log entry
$requestDate and time of log entry
$statusHTTP response code (such as 200 or 404.)
$http_user_agentUser agent making the request (such as Googlebot).
$http_hostRequired if you are serving multiple virtual hosts; indicates the virtual host that the request is intended for.

 

Uploading Log Files to Keylime Toolbox

When we’ve notified you that you have access to your AWS S3 bucket, upload a test file (as described in this section) then send an email to logs@keylimetoolbox.com to let us know that it’s there. We’ll make sure Keylime Toolbox can process the file correctly and make any needed configuration adjustments. We’ll let you know when we’re done. After that, you can set up a script for automatic uploads.

Detailed instructions using several S3 tools follow. The high level steps are:

  1. Upload the log files to the Keylime Toolbox AWS S3 bucket.
  2. Give full control of the uploaded files to “bucket owner” or the Keylime Toolbox AWS ID:
    0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace.

Because you uploaded the files from your account (even though you uploaded them into the Keylime Toolbox account), Keylime Toolbox can’t access the files unless you enable full control.

 Tools For Uploading Logs

Some available tools for uploading files to S3 include:

In the examples below yourbucketid-keylime is your Keylime Toolbox AWS S3 bucket that we email you. You can upload multiple files, but they should be gzipped individually.

aws Example

$ aws s3 cp logfile-2017-02-08.gz s3://yourbucketid-keylime/ --acl bucket-owner-full-control

s3cmd Examples

1. Upload Files to the Keylime Toolbox AWS S3 Bucket
$ s3cmd put logfile-2017-02-08.gz s3://yourbucketid-keylime/
2. Grant Keylime Toolbox FULL_CONTROL of the Uploaded File(s)

Keylime Toolbox can only process the uploaded files if the you grant access to us. To do this give FULL_CONTROL to the Keylime Toolbox canonical user ID: 0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace.

$ s3cmd --acl-grant=full_control:0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace setacl s3://yourbucketid-keylime/logfile-2017-02-08.gz

s3 Express Example

/> put filename-2017-02-08.gz yourbucketid-keylime -cacl:bucket-owner-full-control