Keylime Toolbox Crawl Analytics uses your server logs (web access logs or equivalent) to determine what URLs search engines are crawling and what technical issues the search engine bots are encountering that prevent complete indexing of the site.
Integrating your server logs with Keylime Toolbox involves stripping all non-bot entries from the logs (so that they contain no personally identifiable information) then uploading the logs to an AWS S3 bucket. Ideally, logs are uploaded each day (or more often) for each server.
Frequently Asked Questions
Getting Started
Automatic Uploads
Frequently Asked Questions
Why Does Keylime Toolbox Need Server Logs Uploaded?
Web access logs are the only place to see exactly what URLs the search engine bots are crawling. This information is critical for SEO efforts and answers questions such as:
- Is the site being fully crawled? If it’s not, the site may not be comprehensively indexed.
- Which URLs are being crawled and which aren’t? Crawl patterns shed light on what pages search engines find important.
- How often are URLs crawled? This information can help determine how long it will take for changes to be reflected in search engine indices.
- Are search engines crawling the same page with different URLs? The site may have canonicalization issues causing crawl efficiency and PageRank dilution issues.
- Are pages being crawled that should be blocked from indexing?
- Are search engines getting server errors when crawling pages?
We Have Security Policies In Place That Prevent Us From Giving Out Personally Identifiable Information
Keylime Toolbox doesn’t use personally identifiable information (PII). As you’ll see below, it’s best if you provide us with a filtered file that contains only search engine bot entries.
Who Has Access to This Data?
We store raw web access logs on Amazon’s S3. This data is available only to you, Keylime Toolbox staff who need access to monitor and support your log uploads, and Keylime Toolbox.
We Still Have Security Concerns
Amazon provides more details about the security of AWS, but please contact us at support@keylimetoolbox.com with any questions and we’ll be happy to provide further details.
How Long Do You Store Our Raw Log Data?
We store raw files for up to two weeks; processed output is retained in Keylime Toolbox so you can view historical trends. (See more details on the reports available.)
Getting Started
Getting set up for log file uploads involves the following steps, detailed below.
- Send us your AWS account identifier
- We’ll provide you the name of the bucket we created for you, authorized for your access
- Make sure your log files have the correct data fields in them
- Upload a sample log file to your bucket, setting permissions, so we can review
- Filter your log files for bot traffic and compress them
- Upload them daily (or more often) to the bucket
You will only need to do the first 4 steps once.
Getting Access to Your Keylime Toolbox AWS S3 Bucket
Keylime Toolbox sets up a private Amazon Web Services (AWS) S3 bucket for your account. As part of the initial configuration, you’ll need an AWS account. (You can create a free one if you don’t already have one). You will send us the account identifier so that we can give you permission to upload to your bucket. (Each customer has a separate bucket under our AWS account; no one can access your bucket other than you and we cover all charges related to the bucket.)
If you have not set up AWS previously, the simplest option is to send us the “Canonical User ID” of the AWS Account:
- Create a free Amazon Web Services account.
- Find your AWS account canonical user ID:
- Go to your AWS security credentials page.
- Expand the Account Identifiers section to access the canonical user ID.
- Email your Canonical User ID to logs@keylimetoolbox.com.
If you are already using AWS and have implemented IAM Users, we can configure permissions for a specific User or Role.
- Create the User or Role in AWS IAM.
- Ensure that the User or Role has permissions to upload files to the S3 bucket that we provide for you. The specific permissions you need may depend on your tools but you will at least need
PutObject
andPutObjectAcl
. - Email the ARN for the User or Role to logs@keylimetoolbox.com.
Once we receive your Canonical User ID or IAM ARN, we’ll provide access to your Keylime Toolbox AWS S3 bucket and will send you an email. You can now begin uploading files.
The Log File
Keylime Toolbox supports web access logs in any format. These may come from your web server, load balancer, or something in between. The specific fields are listed under Log File Contents, below; however, the names we need may vary based on the format, so we can work with you on the details. Once we get the file, we’ll be in touch about this if needed.
Source Data Needed
- Web access logs
- Filtered to only bot traffic
- Logs from all servers (if the site operates from multiple servers)
Log File Format Details
- Log files can be in any format
- If not all log files are in the same format, email logs@keylimetoolbox.com after the first upload to let us know
- Compress each file using gzip compression. Each file should be compressed separately
- Ensure files aren’t password protected
- Each file should have no more than 24 hours’ data (it can have less)
- Include the date (in any format) in the file name representing the date of data in the file
- Name files in such a way that they don’t overwrite each other when uploading new files
File Size
Keylime Toolbox can process files of any size, but for ease of uploading, exclude non-bot traffic as described below.
Duration and Upload Frequency
Crawl Analytics work best with comprehensive server logs. Keylime Toolbox processes data nightly and displays the output in daily increments. For best results, set your script to upload logs daily so that you can view the most up-to-date analysis in Keylime Toolbox.
If you are unable to upload comprehensive logs, Keylime Toolbox can work with any duration you’re able to provide.
Bot Filtering
Keylime Toolbox extracts search engine bot traffic only, so you can filter out all entries other than these user agents:
- googlebot
- bingbot
- msnbot
- slurp
- baidu
- yandex
- naver
The following case-insensitive, regular expression pattern will ensure the filtered data includes all variations of these user agents (for instance, Googlebot and Googlebot-Images):
googlebot|bingbot|msnbot|slurp|baidu|yandex|naver|^#Fields:
The regex pattern above needs to be used as a case-insensitive match, because the user agents will appear in the log files like this: Googlebot/2,1, msnbot, and Yahoo! Slurp. For example, you could use enhanced grep like this:
cat access.log | grep -iE 'googlebot|bingbot|msnbot|slurp|baidu|yandex|naver|^#Fields:' > filtered_bots.log
Log File Contents
You can upload the log files in their entirety, and Keylime Toolbox will extract what’s needed. However, if for space or privacy reasons, you’d rather filter the files before sending them, ensure the uploaded files contain:
- Date of log entry
- Time of log entry
- URL requested
- User agent of the requestor
- HTTP response code
- Hostname (ideally; required if the individual log files are for multiple hosts or subdomains)
- IP address of requestor (ideally; but without this we can still provide most reports)
- Protocol: http or https (without this we can still provide reports but they won’t provide as much insight)
Server-Specific Details
The field names for the data Keylime Toolbox uses vary by log file type. Below are some common log types. If you have a different log file format, email a 10-line sample log file to logs@keylimetoolbox.com and we can help you configure your server for the right format.
Apache Combined Log Format
The Apache combined log format contains all the fields Keylime Toolbox needs to provide a comprehensive view of search engine crawl behavior.
The default Apache configuration sets up combined format for logs which includes all the needed information. You should see something like this for each of your virtual sites in your apache configuration:
CustomLog /var/log/apache2/access.log combined
If you have a different log format and don’t need it to be custom, changing it to combined format will ensure Keylime Toolbox can process it.
Apache Custom Format
If you are using a non-standard Apache log format, below are the Apache log formatting macros for the fields that Keylime Toolbox uses.
%h | IP address or PTR name of the client making the request |
%t | Date and time of log entry |
%r | Request line, including method, URI, and protocol. (This is the same format as “%m %U%q %H”) |
%>s | HTTP response code (such as 200 or 404.) |
\"%{User-agent}i\" | User agent making the request (such as Googlebot). |
%v | Required if you are serving multiple virtual hosts; indicates the virtual host that the request is intended for. |
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{User-agent}i\"" vhost_custom
CustomLog /var/log/apache2/access.log vhost_custom
Apache Common Log Format
The Apache “Common” log format doesn’t contain user agent information, so it is not usable by the Keylime Toolbox. If the server is set up with this type of logging, we recommend you switch to Apache “Combined” log format (see above). If this isn’t possible, email logs@keylimetoolbox.com and we can work with you to determine a solution.
IIS Log Format
Below are the needed entries for IIS logs.
date | Date of log entry |
time | Time of log entry |
cs-uri-stem | URL without query parameters |
cs-uri-query | URL query parameters |
c-ip | IP address of the client making the request |
cs(User-Agent) | User agent making the request (such as Googlebot) |
sc-status | HTTP response code (such as 200 or 404) |
cs-host | Required if you are serving multiple virtual hosts; indicates the hostname that the request is intended for |
Akamai
If your site uses Akamai, contact them and ask them for logs containing the required fields above.
Fastly
If your site uses Fastly, you can configure an S3 Log Streaming Endpoint to send logs to your bucket. Please contact us if you wish to do this as we’ll have to create credentials for you to ensure the logs have proper permission when uploaded.
nginx Log Format
The default log format used by nginx includes the needed entries.
If you have defined your own nginx log file format you would have lines similar to this in your nginx configuration file:
log_format gzip '$remote_addr - $remote_user [$time_local] "$request" $status $bytes_sent "$http_referer" "$http_user_agent"'; access_log /spool/logs/nginx-access.log gzip buffer=32k;
The following fields are required for nginx:
$remote_addr | IP address or PTR name of the client making the request |
$time_local | Date and time of log entry |
$request | Date and time of log entry |
$status | HTTP response code (such as 200 or 404.) |
$http_user_agent | User agent making the request (such as Googlebot). |
$http_host | Required if you are serving multiple virtual hosts; indicates the virtual host that the request is intended for. |
Uploading Log Files to Keylime Toolbox
When we’ve notified you that you have access to your AWS S3 bucket, upload a test file (as described in this section) then send an email to logs@keylimetoolbox.com to let us know that it’s there. We’ll make sure Keylime Toolbox can process the file correctly and make any needed configuration adjustments. We’ll let you know when we’re done. After that, you can set up a script for automatic uploads.
Detailed instructions using several S3 tools follow. The high level steps are:
- Upload the log files to the Keylime Toolbox AWS S3 bucket.
- Give full control of the uploaded files to “bucket owner” or the Keylime Toolbox AWS ID:
0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace.
Because you uploaded the files from your account (even though you uploaded them into the Keylime Toolbox account), Keylime Toolbox can’t access the files unless you enable full control.
Tools For Uploading Logs
Some available tools for uploading files to S3 include:
- Command line tools
- aws command line tool (cross-platform; Python with Windows installer)
- s3cmd (cross-platform; Python)
- s3 Express (Windows)
- API libraries:
- User interface tools:
In the examples below yourbucketid-keylime is your Keylime Toolbox AWS S3 bucket that we email you. You can upload multiple files, but they should be gzipped individually.
aws Example
$ aws s3 cp logfile-2017-02-08.gz s3://yourbucketid-keylime/ --acl bucket-owner-full-control
s3cmd Examples
1. Upload Files to the Keylime Toolbox AWS S3 Bucket
$ s3cmd put logfile-2017-02-08.gz s3://yourbucketid-keylime/
2. Grant Keylime Toolbox FULL_CONTROL of the Uploaded File(s)
Keylime Toolbox can only process the uploaded files if the you grant access to us. To do this give FULL_CONTROL to the Keylime Toolbox canonical user ID: 0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace.
$ s3cmd --acl-grant=full_control:0b2b6d7e33c143cff9616ebaaee4c4670db68853a308b382410a1c0bf2ba2ace setacl s3://yourbucketid-keylime/logfile-2017-02-08.gz
s3 Express Example
/> put filename-2017-02-08.gz yourbucketid-keylime -cacl:bucket-owner-full-control