Hack Day 2018: Improving Caplin’s website search

For Caplin Hack Day 2018, ‘Power to the People’, I set myself the challenge of empowering our users by improving our website’s internal search engine.

Background

Our website search is powered by a custom solution based on AWS CloudSearch, an Amazon hosted enterprise search platform.

When we migrated to CloudSearch, we chose the open-source Apache Nutch web crawler to populate our CloudSearch Domain. Apache Nutch provided an easy migration path to AWS CloudSearch, but it was a heavyweight solution for a simple task.

Hack Day goals

For Hack Day 2018, I wanted to replace Apache Nutch with a custom script optimised for our use case. I set myself three goals:

Faster execution time: the new script could improve on Apache Nutch’s execution time by reading our website’s sitemap instead of crawling the website for links.
Easier to deploy and schedule: I wanted to use Jenkins to schedule the new script in ephemeral containers in Kubernetes. This is much harder to achieve with Apache Nutch, which maintains a representation of the website in a local Hadoop database.
Higher quality search results: The new script should give me greater control over which pages are indexed and how content from those pages is extracted. For example, I wanted to try extracting page headings into a separate field and boosting page-rank in search results where search terms appear in headings.

Hackday

I divided the project into four stages:

Create a new CloudSearch domain
Identify CloudSearch documents to update and to delete.
Download updated pages from the website and extract content from them.
Submit batches of upload and deletion operations to CloudSearch.

I decided to write the new indexer in Ruby. I had written some Jekyll plugins in Ruby, and I was keen to learn more about the language.

Creating the CloudSearch domain

I designed a simple schema for the CloudSearch domain comprising five fields: title, content, tstamp (timestamp), url, and headings.

I scripted the creation of the domain using the create-domain and define-index-field commands of the AWS CLI:

#!/bin/bash
DOMAIN=$1

aws cloudsearch create-domain --domain-name ${DOMAIN}

aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name content --type text --sort-enabled false

aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name title --type text --sort-enabled false

aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name tstamp --type date --sort-enabled false --facet-enabled false

aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name url --type literal --sort-enabled false --facet-enabled false

aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name headings --type text --sort-enabled false

Identifying documents to upload and documents to delete

I wanted my script to take a stateless approach instead of following Apache Nutch’s approach of storing a representation of the website locally. My new script identified CloudSearch documents to update and delete by comparing URLs and timestamps on our website with URLs and timestamps in CloudSearch.

I created two Hash collections of timestamps keyed by URL: one for all the documents in our CloudSearch domain and one for all the pages on the Caplin website. From these two collections, I could derive subsets of URLs to upload and delete.

To build the Hash of URLs and timestamps in CloudSearch, I ran a (matchall) query against the CloudSearch domain using the AWS Ruby SDK:

search_options = {
  :query => "(matchall)",
  :query_parser => 'structured',
  :return => 'url,tstamp',
  :size => '200',
  :cursor => 'initial'
}
cloudsearch_urls = {}
response = cloudSearchDomainClient.search(search_options)
until response.hits.hit.empty? do
  for hit in response.hits.hit
    cloudsearch_urls[hit.fields['url'][0]] = Time.parse(hit.fields['tstamp'][0])
  end
  search_options[:cursor] = response.hits.cursor
  response = cloudSearchDomainClient.search(search_options)
end

To build the Hash of URLs and timestamps in the website, I used Ruby’s Nokogiri XML parser to load the Caplin website sitemap and extract the values of <loc> and <lastmod> XML elements:

website_urls = {}
doc = Nokogiri::XML(open(url))
doc.remove_namespaces!
urls = doc.xpath('.//url')
urls.each do |node|
  locs = node.xpath('.//loc')
  unless locs.empty?
    lastmods = node.xpath('./lastmod')
    loc = locs[0].text
    website_urls[loc] = lastmods.empty? ? Time.new : Time.parse(lastmods[0].text)
  end
end

From the two Hash objects, cloudsearch_urls and website_urls, I could then derive two subset Hash objects:

URLs in the website Hash that are either not in the CloudSearch Hash or are in the CloudSearch Hash but with an older timestamp:
```
urls_to_upload = website_urls.select { |key,value| !cloudsearch_urls.has_key?(key) || cloudsearch_urls[key] < value }
```

URLs in the CloudSearch Hash that are not in the website Hash:

urls_to_delete = cloudsearch_urls.select { |key,value| !website_urls.has_key?(key) }

Extracting content from web pages

To extract content from our web pages, I used Ruby’s Nokogiri XML parser.

First I loaded the page into the parser and stripped unwanted HTML elements from the page:

# Convert older pages into UTF8 for Nokogiri
if text.encoding.name == "ASCII-8BIT"
  text = text.encode('UTF-8', 'ISO-8859-1')
end

# Load page into Nokogiri
doc = Nokogiri::HTML(text)

# Simplify page processing by stripping namespaces from the page
doc.remove_namespaces!

# Remove unwanted elements from the page
selector = './/header|.//nav|.//footer|.//script|.//style'
doc.xpath(selector).each { |node| node.remove }

Next, I extracted content and headings:

content_node = doc.at_xpath('.//article') || doc.at_xpath('.//body')
content = content_node.text.gsub(/\s{2,}/, ' ').strip

headings = ''
nodes = content_node.xpath('.//h1|.//h2|.//h3|.//h4|.//h5|.//h6')
if !nodes.empty?
  heading_nodes.each do |node|
    headings = headings + ' ' + node.text
  end
  headings = headings.gsub(/\s{2,}/, ' ').strip
end

Uploading and deleting documents in CloudSearch

To upload and delete documents to our CloudSearch domain, I used the AWS Ruby SDK’s CloudSearchDomainClient.upload_documents method. The method takes a single parameter: a Hash object that contains a JSON or XML description of a batch of upload and deletion operations.

The description of a single operation cannot be greater than 1MB in size and the description of a batch of operations cannot be greater than 5MB. Amazon charge by the batch, not by the operation, so there is an incentive to include as many operations in a batch as possible. For more information on the JSON and XML description format, see Preparing Your Data in the Amazon CloudSearch Developer Guide.

I chose to describe batches in JSON. Ruby includes a JSON module that allows for easy conversion of Ruby arrays and Hash objects to JSON.

I modelled each operation as a Ruby Hash object, and a batch of operations as an array of Hash objects. On addition of the final operation to the array, I submitted the batch to CloudSearch:

awsCloudSearchDomainClient.upload_documents({ documents: batch.to_json, content_type: 'application/json' })

To enforce the 5MB batch size limit, I tested the size of the batch array after adding an operation to it. If the array, converted to JSON, was greater than 5MB in size (batch.to_json.bytesize > 5242880), I removed the last operation from the array (batch.pop), submitted the batch to CloudSearch, reset the array (batch.clear), and re-added the operation to the array (batch << operation).

Conclusion

My project was a success and I was able to put the new script into production soon after Hack Day.

I achieved all three goals:

Faster: The new script builds a full index in under ten minutes, and incremental updates are much faster. Apache Nutch took several hours to run.
Easier to deploy: Without local state, the new script is every easy to deploy. I added a Jenkinsfile to the project and scheduled Jenkins to run the script inside a Kubernetes managed container.
Better search results: The addition of the headings field improves the relevance of search results by boosting page-rank for pages in which search terms appear in headings.