For Caplin Hack Day 2018, ‘Power to the People’, I set myself the challenge of empowering our users by improving our website’s internal search engine.
Background
Our website search is powered by a custom solution based on AWS CloudSearch, an Amazon hosted enterprise search platform.
When we migrated to CloudSearch, we chose the open-source Apache Nutch web crawler to populate our CloudSearch Domain. Apache Nutch provided an easy migration path to AWS CloudSearch, but it was a heavyweight solution for a simple task.
Hack Day goals
For Hack Day 2018, I wanted to replace Apache Nutch with a custom script optimised for our use case. I set myself three goals:
-
Faster execution time: the new script could improve on Apache Nutch’s execution time by reading our website’s sitemap instead of crawling the website for links.
-
Easier to deploy and schedule: I wanted to use Jenkins to schedule the new script in ephemeral containers in Kubernetes. This is much harder to achieve with Apache Nutch, which maintains a representation of the website in a local Hadoop database.
-
Higher quality search results: The new script should give me greater control over which pages are indexed and how content from those pages is extracted. For example, I wanted to try extracting page headings into a separate field and boosting page-rank in search results where search terms appear in headings.
Hackday
I divided the project into four stages:
-
Create a new CloudSearch domain
-
Identify CloudSearch documents to update and to delete.
-
Download updated pages from the website and extract content from them.
-
Submit batches of upload and deletion operations to CloudSearch.
I decided to write the new indexer in Ruby. I had written some Jekyll plugins in Ruby, and I was keen to learn more about the language.
Creating the CloudSearch domain
I designed a simple schema for the CloudSearch domain comprising five fields: title
, content
, tstamp
(timestamp), url
, and headings
.
I scripted the creation of the domain using the create-domain and define-index-field commands of the AWS CLI:
#!/bin/bash
DOMAIN=$1
aws cloudsearch create-domain --domain-name ${DOMAIN}
aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name content --type text --sort-enabled false
aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name title --type text --sort-enabled false
aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name tstamp --type date --sort-enabled false --facet-enabled false
aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name url --type literal --sort-enabled false --facet-enabled false
aws cloudsearch define-index-field --domain-name ${DOMAIN} \
--name headings --type text --sort-enabled false
Identifying documents to upload and documents to delete
I wanted my script to take a stateless approach instead of following Apache Nutch’s approach of storing a representation of the website locally. My new script identified CloudSearch documents to update and delete by comparing URLs and timestamps on our website with URLs and timestamps in CloudSearch.
I created two Hash collections of timestamps keyed by URL: one for all the documents in our CloudSearch domain and one for all the pages on the Caplin website. From these two collections, I could derive subsets of URLs to upload and delete.
To build the Hash
of URLs and timestamps in CloudSearch, I ran a (matchall)
query against the CloudSearch domain using the AWS Ruby SDK:
search_options = {
:query => "(matchall)",
:query_parser => 'structured',
:return => 'url,tstamp',
:size => '200',
:cursor => 'initial'
}
cloudsearch_urls = {}
response = cloudSearchDomainClient.search(search_options)
until response.hits.hit.empty? do
for hit in response.hits.hit
cloudsearch_urls[hit.fields['url'][0]] = Time.parse(hit.fields['tstamp'][0])
end
search_options[:cursor] = response.hits.cursor
response = cloudSearchDomainClient.search(search_options)
end
To build the Hash
of URLs and timestamps in the website, I used Ruby’s Nokogiri XML parser to load the Caplin website sitemap and extract the values of <loc>
and <lastmod>
XML elements:
website_urls = {}
doc = Nokogiri::XML(open(url))
doc.remove_namespaces!
urls = doc.xpath('.//url')
urls.each do |node|
locs = node.xpath('.//loc')
unless locs.empty?
lastmods = node.xpath('./lastmod')
loc = locs[0].text
website_urls[loc] = lastmods.empty? ? Time.new : Time.parse(lastmods[0].text)
end
end
From the two Hash
objects, cloudsearch_urls
and website_urls
, I could then derive two subset Hash
objects:
-
URLs in the website
Hash
that are either not in the CloudSearchHash
or are in the CloudSearchHash
but with an older timestamp:urls_to_upload = website_urls.select { |key,value| !cloudsearch_urls.has_key?(key) || cloudsearch_urls[key] < value }
-
URLs in the CloudSearch
Hash
that are not in the websiteHash
:urls_to_delete = cloudsearch_urls.select { |key,value| !website_urls.has_key?(key) }
Extracting content from web pages
To extract content from our web pages, I used Ruby’s Nokogiri XML parser.
First I loaded the page into the parser and stripped unwanted HTML elements from the page:
# Convert older pages into UTF8 for Nokogiri
if text.encoding.name == "ASCII-8BIT"
text = text.encode('UTF-8', 'ISO-8859-1')
end
# Load page into Nokogiri
doc = Nokogiri::HTML(text)
# Simplify page processing by stripping namespaces from the page
doc.remove_namespaces!
# Remove unwanted elements from the page
selector = './/header|.//nav|.//footer|.//script|.//style'
doc.xpath(selector).each { |node| node.remove }
Next, I extracted content and headings:
content_node = doc.at_xpath('.//article') || doc.at_xpath('.//body')
content = content_node.text.gsub(/\s{2,}/, ' ').strip
headings = ''
nodes = content_node.xpath('.//h1|.//h2|.//h3|.//h4|.//h5|.//h6')
if !nodes.empty?
heading_nodes.each do |node|
headings = headings + ' ' + node.text
end
headings = headings.gsub(/\s{2,}/, ' ').strip
end
Uploading and deleting documents in CloudSearch
To upload and delete documents to our CloudSearch domain, I used the AWS Ruby SDK’s CloudSearchDomainClient.upload_documents method. The method takes a single parameter: a Hash
object that contains a JSON or XML description of a batch of upload and deletion operations.
The description of a single operation cannot be greater than 1MB in size and the description of a batch of operations cannot be greater than 5MB. Amazon charge by the batch, not by the operation, so there is an incentive to include as many operations in a batch as possible. For more information on the JSON and XML description format, see Preparing Your Data in the Amazon CloudSearch Developer Guide.
I chose to describe batches in JSON. Ruby includes a JSON module that allows for easy conversion of Ruby arrays and Hash
objects to JSON.
I modelled each operation as a Ruby Hash
object, and a batch of operations as an array of Hash
objects. On addition of the final operation to the array, I submitted the batch to CloudSearch:
awsCloudSearchDomainClient.upload_documents({ documents: batch.to_json, content_type: 'application/json' })
To enforce the 5MB batch size limit, I tested the size of the batch array after adding an operation to it. If the array, converted to JSON, was greater than 5MB in size (batch.to_json.bytesize > 5242880
), I removed the last operation from the array (batch.pop
), submitted the batch to CloudSearch, reset the array (batch.clear
), and re-added the operation to the array (batch << operation
).
Conclusion
My project was a success and I was able to put the new script into production soon after Hack Day.
I achieved all three goals:
-
Faster: The new script builds a full index in under ten minutes, and incremental updates are much faster. Apache Nutch took several hours to run.
-
Easier to deploy: Without local state, the new script is every easy to deploy. I added a
Jenkinsfile
to the project and scheduled Jenkins to run the script inside a Kubernetes managed container. -
Better search results: The addition of the
headings
field improves the relevance of search results by boosting page-rank for pages in which search terms appear in headings.