Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Books Book Reviews Apache

Book Review: Scaling Apache Solr 42

First time accepted submitter sobczakt writes We live in a world flooded by data and information and all realize that if we can't find what we're looking for (e.g. a specific document), there's no benefit from all these data stores. When your data sets become enormous or your systems need to process thousands of messages a second, you need to an environment that is efficient, tunable and ready for scaling. We all need well-designed search technology. A few days ago, a book called Scaling Apache Solr landed on my desk. The author, Hrishikesh Vijay Karambelkar, has written an extremely useful guide to one of the most popular open-source search platforms, Apache Solr. Solr is a full-text, standalone, Java search engine based on Lucene, another successful Apache project. For people working with Solr, like myself, this book should be on their Christmas shopping list. It's one of the best on this subject. Read below for the rest of sobczakt's review.
Scaling Apache Solr
author Hrishikesh Vijay Karambelkar
pages 215
publisher Packt
rating 9/10
reviewer sobczakt
ISBN 978-1783981748
summary Get an introduction to the basics of Apache Solr in a step-by-step manner with lots of examples
Karambelkar is an enterprise architect with a long history in both commercial products and open source technology. As he says, he currently spends most of his time solving problems for the software industry and developing the next generation of products.

The book is divided into 10 chapters. Basically, the first three are an introduction to Apache Solr and cover its architecture, features, configuration and setting up. Chapter One contains many practical cases of Apache Solr, to help beginners understand the topic.

Chapter Four is very interesting and describes a common pattern for enterprise search solutions. These patterns focus on data processing/integration and how to meet the requirements of users (interface, relevancy, general experience).

The rest of the book mainly refers to the central topic, that is distributing search queries and how to scale/optimize a system. The book discusses all Apache Solr concepts like replication, fault tolerance, sharding and illustrates them with helpful examples. The book precisely explains SolrCloud — a bundle of built-in distributed capabilities available from version 4.0.

Chapter 8, dedicated to optimization, drew my attention. It is full of useful tips concerning JVM parameters and manipulating data structures or caching layers as well.

Scaling Apache Solr covers both basic and advanced subjects. The information is well organized, clear and concise. Lots of examples and cases in this book can be absorbed by beginners. I was nicely surprised by the chapter describing integration possibilities. There's some great information about using Solr with Cassandra, MapReduce paradigm or R (programming language for computational statistics) although I would have preferred this subject to be covered in more detail. The book has two more advantages: first, it discusses designing an enterprise search system in general terms and second, it can be treated as an introduction to large volume data processing.

I believe I need to emphasize that many sections related to defining a schema, importing data, running SolrCloud or searching in near real time (NRT) are not just a raw documentation, they also have the author's well-judged advice and comments.

Unfortunately, I felt some of the more advanced topics were not described in enough detail. For example, index merging, documents relevance or using dynamic fields in data structure. Moreover, reading the book, I had a feeling that some parts do not fit the title, such as the section about clustering with Carrot2 or integration with PHP web portal.

I can say that I have read this book with pleasure and satisfaction, which in fact is rare regarding technology publications. For me, as a person who has been working with Solr since version 1.3, it was a great way to review and sort out some of its aspects. On the other hand, I'm pretty sure, that people starting their experience with Apache Solr will take a lot from this book. Although, it is mainly focused on advanced problems, it starts with the basics.

Despite some little imperfections I recommend this book, especially because it describes the concrete technology in an easy-to-read way and also refers to some general architectural patterns.

You can purchase Scaling Apache Solr from amazon.com. Slashdot welcomes readers' book reviews (sci-fi included) -- to see your own review here, read the book review guidelines, then visit the submission page. If you'd like to see what books we have available from our review library please let us know.
This discussion has been archived. No new comments can be posted.

Book Review: Scaling Apache Solr

Comments Filter:
  • by ArcadeMan ( 2766669 ) on Monday October 13, 2014 @01:36PM (#48133099)

    Hrishikesh Vijay Karambelkar, haswritten an extremely useful guide to one of the most popular open-source search platforms, Apache Solr.

    It's so popular that I never heard of it before today.

    • Re:Apache what? (Score:4, Informative)

      by nahpets77 ( 866127 ) on Monday October 13, 2014 @01:58PM (#48133323)
      I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.
      • I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.

        We use it to parse and index OCRd PDFs for full-text searching.
        My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

        • What would you use as an alternative to Solr? One of the things I use it for is to index several internal wikis so that we can have a centralized search engine (also the default search engines suck). In this case, I need to index the content as well. The thing that gave me the most difficulty was tweaking the config so to get page rankings "just right".
        • by Shados ( 741919 )

          It really depends in what industry or subset of an industry you're in... I had to work on implementing something like that once for legal at an extremely large (and famous, or rather, infamous) company. Lawyers needed to run full searches against all our documents very very quickly to go through the bazillion lawsuit threats we were getting on a daily basis to figure out if they had some weight or not. That very much required full text search.

          • Yes, if you have to support it, you have to support it. I would steer clear from Solr based on my experience with it, however.
            We only added it on because the shit we use integrates well with it. It was a "why not?" that works well enough to not be ripped out, but I wouldn't do it again unless I had to.

        • by nbauman ( 624611 )

          My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

          Lawyers use it. Magazines use it. Lots of people use it.

          • My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

            Lawyers use it. Magazines use it. Lots of people use it.

            Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.

            Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can

            • by nbauman ( 624611 )

              My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

              Lawyers use it. Magazines use it. Lots of people use it.

              Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.

              Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can go ahead and hire a monkey to type it in. You're still left with Solr sucking, but on top of that much of a magazine's content is so heavily formatted/styled/image-based that a Solr index would not suit it well.

              If you NEED a fulltext index. there are plenty of alternatives, some mentioned by others in the comments on this article. I can only speak to OCR sucking, Solr's indexer sucking, and Solr's search giving me way too many things for it to be useful.

              What are some fulltext indexed open-source alternatives to Solr?

        • In that case, I must be you if you're the only one using it. :D Nope, the problem of full-text search is that is doesn't go deep enough. Especially in technical fields, it would be great to have search engines with greater level of text comprehension than genetic FTS (for example, what if the text uses a synonym instead of exactly the term you're looking for?)
    • Have you ever needed to implement side-wide search functionality? (note, this is not the same thing as a global web search company like Google, Bing, etc)

      If you have, and it involved anything more than turning on a checkbox in your platform, then you have almost undoubtedly encountered or considered Solr.

    • by Shados ( 741919 )

      There's a lot of popular things I've never heard about. The indexing/search space is actually pretty big, because its one of those things everyone thinks is trivial, until you need to actually do it in meaningful ways or scale. Almost everyone hits a big fat roadblock, and start looking for tools to do it (since its more or less a solved problem). For the longest time, Lucene was a defacto standard, but its fairly low level as far as indexing and searching goes, and everyone reinvented the wheel over it.

      So

    • by beholder ( 35083 )

      Hrishikesh Vijay Karambelkar, haswritten an extremely useful guide to one of the most popular open-source search platforms, Apache Solr.

      It's so popular that I never heard of it before today.

      That's the fun part about the IT world. That even products as popular as Lucene/Solr backed by companies with Billion dollars investment (e.g. Cloudera) may not be known by everybody.

      The good thing is that when you actually need to build a search for something, we have a solution for you. Built, tested and iterated on by the hundreds of full-time and hobby-time developers working on it while you are doing other - I am sure exciting - things in your own corner of the software universe.

      P.s. This is not a comm

  • by Anonymous Coward

    Doesn't one just use MongoDB and Solr automatically becomes web scale?

    • Doesn't one just use MongoDB and Solr automatically becomes web scale?

      Speaking of web scale what's wrong with Google? Google custom search is really easy to set up (free or not free de-branded), and just works really well. The only use-case I can see for something like this, would be for stuff that can't go "web scale", because its private. What else are people using it for?

  • by Anonymous Coward on Monday October 13, 2014 @01:51PM (#48133243)

    It's also based on Lucene, and has an easier setup and administration interface.

  • by Anonymous Coward

    Meanwhile all those actually using Solr/Lucene and who care about scaling have already moved to Elastic Search [elasticsearch.org] and don't need this book.

  • by bluefoxlucid ( 723572 ) on Monday October 13, 2014 @02:57PM (#48133845) Homepage Journal

    Searching and indexing information isn't a computer problem. We can already find information in massive databases--MongoDB and PostgreSQL handle that well.

    It's tagging information that's difficult. Contextual full-text searches often fail to find relevant context. Google does an okay job until you're looking for something specific. General information like melting arctic ice sheets or the spread of Ebola find something relevant; but try finding the particular documents covering the timeline Wikipedia gave for Thomas Duncan's infection, and each of the things the nurse said. You'll find all kinds of shit repeated in the media, but not how they originated. Some of the things in there are notoriously hard to find at all.

    I've thought about how to structure a Project Management Information System for searching and retrieving important data. Work performance information, lessons learned, projects related to a topic themselves. This steps beyond multi-criteria search to multi-dimensional search: I want to find all Lessons Learned about building bridges; I want to find all Programming projects which implemented MongoDB and pull all Work Performance Information and Lessons Learned about Schema Development; etc. I need to know about specific things, but only in specific contexts.

    For this to work well, people need to tag and describe the project properly. The Project Overview must carry ample wording for full-text search; but should also be tagged for explicit keywords, such that I can eschew full-text search and say "find these keywords". It would help if project managers marked projects as similar to other projects, and tagged those similarities (why is it similar?). A human can highlight what particular attributes are strongly relevant, rather than allowing the computer to notice what's related.

    With so much information, searching requires this human action to improve the results. It may also be enhanced by individualized human action: what humans produce what tags and relationship? What humans do you feel provide useful tagging and relationships? What particular relationships do *you* find important? What relationships do you want to add yourself? This will allow an individual human to tailor the search to his own experiences and needs.

    On top of that, such things require memory: a human must remember certain things to know what to search for. I remember working on a project where... ...and so this becomes relevant to this search, and let me find similar things.

    Computer searching is a crude form of human memory: human memory is associative, and computer searching is keyword-driven. Humans need to use their own memories, to tell the computer how they see things, and then to tell the computer how they think about what they want to know--what it's related to, what it's similar to, who they think knows best about it--and have the computer use all that information to retrieve a data set. To do that, humans must manually remember in the computer and in their brains.

    The holy grail of searching is a strong AI that takes an abstract question, considers what you mean by its experience with you and its database of every other experience, pulls up everything relevant, decides what you would want to see, and discards the rest. Such a machine is largely doing your job: it's thinking for you, deciding what you'll remember, and making your decisions by occluding information which would affect your decisions. Anything less is a tool, and faulty, and requires your expertise to leverage properly.

    • I wrote a few stories about this. http://www.nasw.org/users/nbau... [nasw.org]

      The best search engine I've ever seen is PubMed http://www.ncbi.nlm.nih.gov/pu... [nih.gov] They structure information better than anybody else. But it requires a librarian to look at every document and code it according to a fairly elaborate coding scheme, the MESH headings, which basically requires a degree in library science and a good medical education to do well.

    • since you're down the tag route already, you might want to research how "faceting" works in search (think of the categories you see on some sites down the lefthand nav after a search...but generated dynamically)

    • "Computer searching is a crude form of human memory: human memory is associative, and computer searching is keyword-driven".

      Computer searching is completely different from human memory (to the extent that we really should use different words for them): for a start, human memory is associative, and computer searching is keyword-driven. More to the point, human memory is inextricably tied up with all our senses and the ways in which the brain remembers them, whereas computer searching consists of running algo

      • Keyword-driven searching is associative, but only in a minimal form. Humans remember things by remembering other things; they expect to find things in a computer by remembering something about the thing they want to find, and then entering it into the computer.

        In human memory, this brings up every association, categorized, detailed, and sorted by relative strength of association and frequency of use. On computers, we can track frequency of use automatically; strength of association is not automatic bec

  • What a surprise! A Slashdot Book Review with 9/10 rating.
    https://www.google.com/?q=site... [google.com]'
    You might want to normalize the ratings in your book reviews.

As of next week, passwords will be entered in Morse code.

Working...