Solr is an apache project which can be described in many ways. Some like to see it as an open source enterprise search platform rivaling well established commercial offerings like Autonomy, Fast, and Dieselpoint. But it's very hard to get a really good grasp of what Solr is and what it does with this definition, especially as the phrase “Enterprise Search Platform” has different meanings for different people. I like to view Solr as an attempt to gather all the experience people have had over the years developing search solutions with Lucene and build a search platform based on these experiences and all the known (and unknown) best practices.
NOTE: For those who are not familiar with Lucene, it is a low level IR (Information Retrieval) library implemented in Java. With Lucene you can index textual data and execute free text search on it in a highly performant manner.
So what's the purpose of it all? Well, I guess it depends on who you're asking and on the context in which search is applied. For example, if you're a web developer looking to integrate search in a web site, Solr can be a perfect fit for you. It can be installed as a standalone server and exposes its search functionality via a REST-like API (which also makes it language independent so it doesn't matter whether you develop in Java, PHP, .NET or any other preferred language/platform). Also note that the core search functionality that is supported out of the box by Solr is probably more than enough to what most websites require.
That's all well, but Solr is by far not limited to website search. It is not for nothing that it is often compared with Fast, Dieselpoint or other enterprise search platforms. You can build quite large scale and complex search solutions with it and the list of high profile companies (CNet, AOL, Digg, and more) already using it to power their search requirements is a testimony for that.
GETTING SOLR UP AND RUNNING
You can get Solr up and running in practically no time. You'll first need to download it from the following site: http://www.apache.org/dyn/closer.cgi/lucene/solr.
Once downloaded, you can extract the compressed archive (zip or gzip – depending one your platform) anywhere you want on your file system. I would now like you to pay attention to two folders in the extracted directory – dist and example.
As I already mentioned, Solr is essentially a search server implemented in Java. It is actually, a standard web application that can be deployed in a normal servlet container such as Tomcat. In the dist folder, you can find the war file for this server (along with several other jar files which serve for different purposes... well... I'll cover them in a later post). But as you probably know, it's always a bit of a hustle to work with war files – you first need to set up a servlet container, then you need to set up a few environment variables, then deploy the war, etc, etc, etc... Luckily, the developers of Solr acknowledged that and decided to make it even easier for you to get started with it, hence the example folder. This folder contains a Solr distribution bundled with a Jetty server. Here is the layout of the example folder:
Now, all you need to do to start Solr server, is run the following command from this folder:
Congrats! You are now running a Solr server.
INTERACTING WITH SOLR
Now that you have Solr up and running, it's time to do something with it. As a search service, the two main operations that Solr supports are indexing and searching. First, you need to send Solr data to index after which you'll be able to perform free text search on this data. But what is this data and how does it look like?
A world of DocumentsIn Java, we're used to model the world in terms of objects and properties. In the IR world, the world is modeled as Documents and fields. A Document represents a unit of data and is made up of one or more fields. A field is a simple text based name-value pair which holds the actual data. For example, a web page can be represented as a document with 3 fields - URL, title, and body:
And here's how you can model a person as a Document:
Indexing DocumentsNow that you understand how data is represented in Solr, it is time to send Solr a few documents to be indexed. For that, we'll used yet another useful tool that Solr ships with. In theexample/exampledocs directory you will find several XML files and a post.jar file. The latter, is a tool which can post document files to Solr to be indexed. If you open one of the XML files, you'll see that each file actually holds an XML structure that represents an “add” command. Thepost.jar accepts a list of files as an argument and send these files to Solr using HTTP POST request to a dedicated “update” URL. Make sure that Solr is running, and execute the following command:
When executed, this command will send all documents in all XML files to Solr. After they've all been sent, a “commit” command is sent which makes these documents available for search.
NOTE: without committing the documents will still be indexed, but they will not be available for search until either a “commit” is executed, or the Solr server is restarted. Luckily, the post.jar tool sends a “commit” request automatically after sending the documents.
Searching for documentsWe managed to index some documents in Solr, all that is left now is to search for them. Just like Solr has a dedicated URL for indexing documents, it also has a dedicated URL for searching for documents (actually, there can be more than one such URL but I won't get into that right now). By default, the search URL is:
The result returned from Solr for this query is an XML document containing some meta data over the request and query execution (e.g. how long did the search take), and also a list of the matched documents (also referred to as “search hits”). Notice, that it's only a partial list of the documents – while the “numFound” attribute of the <result> element shows that 26 documents were found, only 10 are actually returned. There's a very good reason for that – as Solr is designed to index millions of documents, it make very little sense to return such large search results at once. Therefore, Solr returns the results one “page” at a time. By default the page size is 10 and if not specified otherwise the first page is returned (that is, the first 10 documents). You can control this behavior by providing 2 extra parameters – rows (determine the page size) andstart (determines the zero-based index of the first document in the page):
Understanding and controlling the returned resultAs you can see in the returned XML, each document in the result is returned with its fields. When configuring Solr, one can specify the schema of the index. I will not go into it right now, but in general, the schema determines what fields (name and type) a document is expected to have and also how should Solr handle them. Fields can be handled in 3 ways:
- Indexed - a field value is being broken into tokens which are filtered and indexed so it will be possible to search on it
- Stored - a field value is stored as a whole so when the document is read from the index the original field value can be restored.
- Index and Stored - the two combined
SortingBy default, the search hits are sorted by their score in a descending manner (highest score first). It is however possible to request to sort based on other field values. To do this, you can add the [sort] parameter to the request. This parameter can hold a comma separated list of sort “specification” where each specification defines the fields to sort on and the direction of the sort (ascending/descending). When multiple sort specifications are set, the search result will be sorted on each specification in turn. For example, let's say the following request is sent:
Query SyntaxSo far, we've seen two types of queries – free text queries (like the “monitor” in the example above) and “match all” query (*:*). The truth is, that Solr supports a much richer query syntax than this. Here are two examples of more advanced search queries:
- q=”name:john” - perform free text search on specific fields
- q=”age:[0 TO 20]” - perform a range query on the [age] field.
- q=”name:john AND age:[0 TO 20]” - It is possible to compose queries with boolean constructs (AND, OR, NOT)
SolrJUp until now, we used the browser to interact with Solr. But it is more likely that you'll be using Solr services from another application. To simplify this type of communication, client libraries where developed for the most common development languages. SolrJ is such a library developed in Java (You can find the SolrJ jar in the dist directory as described above). Here's a small snippet of code that demonstrates how you can use SolrJ to connect to a Solr server, index documents and search on them: