The Apache Software Foundation's widely used open source Lucene/Solr search engine package has been upgraded to accommodate its users' seemingly insatiable need to collect and use ever-larger amounts of data.
"The biggest improvement that has happened to Lucene/Solr is scalability," said Sarath Jarugula, vice president of product management at LucidWorks, which offers a commercially support version of Lucene/Solr. "Lucene/Solr has been re-architected to index data across hundreds of servers," he said.
[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Keep up with the latest approaches to managing information overload and staying compliant in InfoWorld's Enterprise Data Explosion newsletter. ]
The keepers of the project plan to release Lucene/Solr 4.0 within the next day or so. Version 4.0 has been three years in the making.
While IT professionals may not have heard of Lucene or Solr, many probably have used these technologies at some point, as the software is embedded in a number of enterprise search products. Many e-commerce and social media sites, such as Facebook and Twitter, also use Lucene/Solr to power their search services.
Doug Cutting, who also created the Apache Hadoop data processing platform, built Lucene as a full-text search engine based on Java. While Lucene is a Java library of search functions, Solr provides an API (application programming interface) so other applications can interface with Lucene. Although Lucene and Solr started as separate projects, the two were merged into a single entity in 2010, now called Apache Lucene/Solr.
This new update reflects how organizations are ingesting and reusing more and more data.
Ten years ago, Jarugula noted, larger organizations might have stored a few million electronic documents, which collectively took up several hundred gigabytes. These days, however, such repositories have ballooned in size: It is not uncommon for Jarugula to encounter organizations that generate a terabyte of data a day.
Lucene/Solr has been updated to handle such larger workloads.
Most significantly, the Solr component includes a new technique called distributed indexing, which divides document indexing duties across multiple servers to speed response time even as the data sets grow larger. To further speed operations, Solr now can spawn multiple threads to index material, with each thread being able to write to disk concurrently.
The software can now also recognize when it operates in a clustered server environment and adjust its actions to the new setup. This set of technologies comes under the name SolrCloud. "If you have a cluster, Solr will know will any server goes down and will watch for when it comes back up," Jarugula said. To help with these with duties, Lucene/Solr uses the Apache ZooKeeper cluster configuration management software.
The distributed indexing also shortens the time indexed material is made available to users, which paves the way for real-time search. Typically, enterprise search engines only update their indices once a day, or once every few hours. Lucene can now update continuously, even with a data set of billions of documents. "You can now index on a per-second basis," Jarugula said.