Creating an Easy to Manage Search Engine with Catalyst and ElasticSearch

Overview

ElasticSearch is a search engine based on Lucene that has a number of really cool features that in my opinion, elevate it above a number of other search engines.

For instance, it's schema-less, which some would argue is a bad thing, but the way things are indexed (indexed "things" care called documents) in ElasticSearch allows the user to create a sort of per-document schema much like you would with MongoDB or other document-based storage engines. It also has an "autodiscovery" feature for other ElasticSearch instances on the network. All you have to do is run bin/elasticsearch on the machines you want to cluster and poof, you have a distributed and fault-tolerant index.

So! Let's get into some code and setup.

Getting ElasticSearch

Step 1

Download your desired version and build of ElasticSearch here: http://www.elasticsearch.com/download/ (you can also build from source)

Step 2

Decompress (or build) ElasticSearch into your desired location. It's really not important where you do this, but /opt/elasticsearch is where I put mine.

Step 3

Start your instances by typing bin/elasticsearch in the root directory where you decompressed ElasticSearch. You can also run with the -f switch to have it run in the foreground and spit out debug information.

A Simple API Introduction

ElasticSearch is the Perl binding to the ElasticSearch REST API, and is written (marvelously) by Clinton Gormley. It has a few key methods we will be using in this article.

new

Creates your connection to your ElasticSearch instance(s).

index

Indexes your data. Takes an index name, a document id (unique, autogenerated if you leave it out), and your data, which should be in the form of a hashref.

search

Search your indexed data. Takes an index name, a query type (you can also type your documents when you index them, for instance, a document that is an email, or a tweet), and your query string. There are a number of search options you can use to query your data, but the one we'll use here is the field query.

Okay. So that's a basic ElasticSearch API. There are plenty of examples on the site you can check out if you feel you need to grok this more thoroughly.

Next, we figure out how to tie this thing to Catalyst.

Catalyst::Model

We will be creating a small model to hook up our ElasticSearch model to our Catalyst application.

Code:

Search.pm:

    package Search;

    use Moose;
    use namespace::autoclean;
    use ElasticSearch;

    has 'es_object' => (
        is       => 'ro',
        isa      => 'ElasticSearch',
        required => 1,
        lazy     => 1,
        default  =>  sub {
            ElasticSearch->new(
                servers     => 'localhost:9200',
                transport   => 'httplite',
                trace_calls => 'log_file',
            );
        },

    );

    sub index_data {
        my ($self,  %p) = @_;
        $self->es_object->index(
        index => $p{'index'},
            type  => $p{'type'},
            data  => $p{'data'},
        );
    }

    sub execute_search {
        my ($self, %p) = @_;
        my $results =  $self->es_object->search(
            index => $p{'index'},
            type  => $p{'type'},
            query => {
                field => {
                    _all => $p{'terms'},
                },
            }
        );
        $results;
    }







    1;







MyApp::Model::Search:

    package MyApp::Model::Search;

    use Moose;
    use namespace::autoclean;

    sub COMPONENT {
        my ($class, $c, $config) = @_;
        my $self = $class->new(%{ $config });

        return $self;
    }    

    __PACKAGE__->meta->make_immutable;




Okay. So we have the search portion set up. This will be called like my $results = $c-model('Search')->results(%opts)> from inside our application.

The next step is to set up an indexer. My example uses DBIx::Class as the source of data to index, as that's what I originally wrote all this for. However, you can use an arbitrary data source as long as you can break it up into the bits that ElasticSearch needs.

The script:

    use Search;
    use My::Schema;

    my $schema = My::Schema->connect("dbi:Pg:dbname=mydb", "user", "pass");
    my $search = Search->new;
    my $rs = $schema->resultset('Entry')->search({ published => 1 });
    print "Search obj: " . Dumper $search_obj;
    print "Beginning indexing\n";

    while ( my $entry = $rs->next ) {
       print "Indexing " . $entry->title . "\n";
        my $result = $search_obj->index_data(
            index => 'deimos',
            type => $entry->type,
            data => {
                title       => $entry->title,
                display_title => $entry->display_title,
                author      => $entry->author->name,
                created     => $entry->created_at ."",
                updated     => $entry->updated_at ."",
                body        => $entry->body,
                attachments => \@attachments,
            },
        );

    }

That is a basic script to get our data indexed. To confirm, we can run a few cURL searches:

    curl -XGET 'http://127.0.0.1:9200/_all/_search'  -d '
    {
       "query" : {
          "field" : {
             "_all" : "your search terms that you know will get you a document returned"
          }
       }
    }

This will return something like:

    {
       "query" : {
          "field" : {
             "_all" : "test"
          }
       }
    }'
    {"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":4,"max_score":0.24368995,
     "hits":[{"_index":"ourindexdeimos","_type":"post","_id":"l_3Jw9PkRz2arFdHO3t5Pg",
     "_score":0.24368995, "_source" : {
    "thingy":"thingy data"
    }

If you get something looking like that, congrats! Your data index properly.

Executing searches within your application

So here we go, what we all came here for.

Here is the Search controller:

    package MyApp::Controller::Search;
    use Moose;
    use namespace::autoclean;
    BEGIN { extends 'Catalyst::Controller::REST'; }




    sub base : Chained('/') PathPart('') CaptureArgs(0) {
        my ($self, $c) = @_;
        my $data = $c->req->data || $c->req->params;
        my $results = $c->model('Search')->results( 
            terms => $data->{'q'}, 
            index => $data->{'index'} || "default", 
            type => $data->{'type'} || "post" 
        );
        my @results;
        for my $result ( @{$results->{'hits'}{'hits'}} ) {
            my $r = $result->{'_source'};
            my $body = substr($r->{'body'}, 0, 300);
            $body .= "...";
            push @results, {
                display_title => $r->{'display_title'},
                title   => $r->{'title'},
                created => $r->{'created'},
                updated => $r->{'updated'},
                author  => $r->{'author'},
                body    => $body,
            };

        }
       $c->stash( results => \@results ); 

    }




    sub index :Chained('base') PathPart('search') Args(0) ActionClass('REST'){
        my ($self, $c) = @_;

    }

    sub index_GET {
        my ($self, $c) = @_;
        $self->status_ok($c, 
            entity => {
                results => $c->stash->{'results'} ,
            },
        );
    }







    __PACKAGE__->meta->make_immutable;
    1;

And a simple template to display them:

    <h2>Search results for "<strong>[% c.req.param('q') %]</strong>":</h2>
    <ul>
    [% FOR result IN results %]
    <li>
    <div>By [% result.author %]</div>
    <div><a href="[% c.uri_for_action('/url/to/your/document', [ result.title ]) %]">[% result.display_title %]</a></div>
    <div>[% result.body %]</div>
    </li>
    [% END %]
    </ul>

And there you go. A very simple, flexible, and relatively fast search engine, with the ability to use any data storage back end for your indexable data.

Parting notes

ElasticSearch is extremely customizable and tuneable. You can get a GREAT deal of performance improvement by playing with the indexing options, ranking algorithms, storage and request transports. All of this is documented again at the ElasticSearch web site.

One final thought: you can add the portion of the indexer code that actually inserts the document into the search index right after your "commit" portion of your data store for your application. This way, you get virtually instantaneous indexing of your document upon its creation.

Enjoy folks, I hope you find this as useful as I did!

AUTHOR

--Devin Austin, <dhoss@cpan.org>

Created using Catalyst 5.80029 on a Mac Book Pro Perl version 5 revision 12 subversion 0 =cut