I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.
I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.
We use it to parse and index OCRd PDFs for full-text searching. My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
What would you use as an alternative to Solr? One of the things I use it for is to index several internal wikis so that we can have a centralized search engine (also the default search engines suck). In this case, I need to index the content as well. The thing that gave me the most difficulty was tweaking the config so to get page rankings "just right".
It really depends in what industry or subset of an industry you're in... I had to work on implementing something like that once for legal at an extremely large (and famous, or rather, infamous) company. Lawyers needed to run full searches against all our documents very very quickly to go through the bazillion lawsuit threats we were getting on a daily basis to figure out if they had some weight or not. That very much required full text search.
Yes, if you have to support it, you have to support it. I would steer clear from Solr based on my experience with it, however. We only added it on because the shit we use integrates well with it. It was a "why not?" that works well enough to not be ripped out, but I wouldn't do it again unless I had to.
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.
Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.
Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can go ahead and hire a monkey to type it in. You're still left with Solr sucking, but on top of that much of a magazine's content is so heavily formatted/styled/image-based that a Solr index would not suit it well.
If you NEED a fulltext index. there are plenty of alternatives, some mentioned by others in the comments on this article. I can only speak to OCR sucking, Solr's indexer sucking, and Solr's search giving me way too many things for it to be useful.
What are some fulltext indexed open-source alternatives to Solr?
In that case, I must be you if you're the only one using it.:D Nope, the problem of full-text search is that is doesn't go deep enough. Especially in technical fields, it would be great to have search engines with greater level of text comprehension than genetic FTS (for example, what if the text uses a synonym instead of exactly the term you're looking for?)
Computers can figure out all kinds of problems, except the things in
the world that just don't add up.
Apache what? (Score:3)
It's so popular that I never heard of it before today.
Re:Apache what? (Score:4, Informative)
Re: (Score:3)
I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.
We use it to parse and index OCRd PDFs for full-text searching.
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Re: (Score:2)
Re: (Score:0)
Re: (Score:3)
It really depends in what industry or subset of an industry you're in... I had to work on implementing something like that once for legal at an extremely large (and famous, or rather, infamous) company. Lawyers needed to run full searches against all our documents very very quickly to go through the bazillion lawsuit threats we were getting on a daily basis to figure out if they had some weight or not. That very much required full text search.
Re: (Score:2)
Yes, if you have to support it, you have to support it. I would steer clear from Solr based on my experience with it, however.
We only added it on because the shit we use integrates well with it. It was a "why not?" that works well enough to not be ripped out, but I wouldn't do it again unless I had to.
Re: (Score:2)
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
Re: (Score:2)
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.
Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can
Re: (Score:2)
My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.
Lawyers use it. Magazines use it. Lots of people use it.
Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.
Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can go ahead and hire a monkey to type it in. You're still left with Solr sucking, but on top of that much of a magazine's content is so heavily formatted/styled/image-based that a Solr index would not suit it well.
If you NEED a fulltext index. there are plenty of alternatives, some mentioned by others in the comments on this article. I can only speak to OCR sucking, Solr's indexer sucking, and Solr's search giving me way too many things for it to be useful.
What are some fulltext indexed open-source alternatives to Solr?
Re: (Score:2)