Do you ever feel like change sneaks up on you? Or feel surprised at how much progress you’ve made over the course of time, when you don’t notice it day by day?
I just got that feeling when showing someone how we use search on structured data, instead of using a database. There’s a set of structured data problems that can be handled very nicely with today’s search technology. Over the last decade, search has steadily become a better solution for these problems. I’m deeply engaged with search technology and work with it every day, yet I’m bowled over when I take a step back and see how far it’s come.
Most people still automatically turn to the familiar database and are surprised by the thought of using search. If that includes you, then read this blog post to understand where search fits today and the kind of problems you can solve – in a different way than you are used to. For many applications traditionally handled with databases, it’s remarkable what you can do with search engines.
We all know how quickly the volume and variety of data is growing, while at the same time more and more business processes and decisions are data-driven. It’s very common to be in a situation where data is spread out across different systems, and where fitting together the puzzle pieces across this data is important. The problem of combining heterogeneous data sources, often referred to as information silos, is called data integration.
The data integration problem comes up over and over. “Know your customer” initiatives depend on combining data from different sources; law firms must aggregate data from multiple systems to get a central view of a client or matter; combining research results from different bio-informatics repositories is essential in life sciences research; logistics and operations centers combine data from suppliers and consumers together; corporate mergers require integrating data from multiple different systems; the list goes on and on.
Approaches to this problem have evolved over decades, and there has been a huge amount of innovation over time:
Each of these approaches is valid and can be effective; the best fit depends on the application. Traditional SQL-based data warehouses and data marts are by far the most prevalent. However, I suspect many folks are using SQL databases by reflex, rather than considering newer approaches, which could be a better fit.
For the rest of this blog post, I’ll compare traditional SQL databases and search-based data integration. I won’t dive into data virtualization (usually applied in distributed transaction systems), or data lakes and data hubs (usually applied in data science initiatives) here. For the kind of applications where people use data marts, search-based data integration is the main alternative. To understand where search indexes can be better than relational databases in this domain, let’s consider a few applications.
Keeping systems running smoothly is important to all organizations, and vital to many. And there is a big market just for software tools to aid this: network management, application management, security management, and the like. Back when I was CTO of Empirix (about 16 years ago) we were one of the vendors in that market, and every single system included a SQL-based data mart with current and historical data. That’s what we knew, and that’s what worked best at the time.
Splunk, a company founded in 2003, changed that. They built a search-based repository as the heart of their product, from which they generate graphs, reports, alerts, dashboards, and visualizations. This disrupted the whole industry and rocketed Splunk to an IPO. More recently, Elasticsearch has made enormous inroads into this market. They have dozens of OEM partners building IT management tools on Elasticsearch, and a major fraction of their 100M+ downloads are for this kind of application. These days a search-based approach is definitely the way to go for log analytics.
One outgrowth of these characteristics is the advent of some new visualization tools based on search, including Kibana and Graphana. These take advantage of the fast query performance and the ability to do ad hoc queries to provide a BI-style dashboard experience that is incredibly flexible and easy for users to tweak.
At BA Insight, when decided to build our analytics product, Smart Analytics, we had a choice to make: use a traditional SQL-based approach (including pre-processed cubes for reporting and a BI tool such as PowerBI), or use a search-based approach. We chose to build on Elasticsearch and Kibana, and never looked back.
Business Intelligence has a huge range, from simple spreadsheets through enterprise-scale production reporting. Operational dashboards, showing important metrics that may be tailored to specific individuals, are a common pattern. These are predominantly done through SQL-based databases and data visualization libraries or dashboarding tools. But increasingly this kind of dashboard is search-driven. Ten years ago this involved proprietary dashboarding tools adapted to search (at FAST Search we had a thing called FAST Radar); five years ago some major BI vendors like Tibco Spotfire provided search integrations; today there are great open source tools and libraries.
At BA Insight, we’ve worked with a number of very large legal clients that were looking at two distinct projects: Enterprise Search and Personalized Data Dashboards. We were able to solve both projects with a single search engine, cutting implementation time way down, and providing extreme flexibility for the future. Now, when they want to add new data, it's easy, and the data is available in both the dashboards and the search engine at the same time. This process allowed our clients to do data integration without heavy lifting in SQL queries, and the flexibility of the search engine schema means better tolerance of incomplete and dirty data. (I’ve learned that in the real world, there is no such thing as perfectly clean data - all data is dirty). The same factors that made the search-based approach disrupt the IT management space apply to these dashboards: faster performance, more flexible schema, and natural, well-understood UI. What is remarkable to me is how quickly lawyers learn to use ad-hoc queries to dig into their dashboards, and how happy they are with this ability.
Logistics management is a big field, ranging from supply chain management through production and distribution. Nearly everywhere you look there are data integration problems – between different companies as well as different systems at the same company. Up-to-date, accurate information must be available quickly.
In e-Commerce (a part of logistics management, or a kissing cousin, depending on your viewpoint), search engines took over from databases for online shopping sites in the late 1990s. The ability for search to handle very large query loads and the natural UI made this a no-brainer. Now the same is starting to happen in other areas of logistics.
The dramatic experience of XPO Logistics, a supplier of pivotal supply chain solutions, is a great example. They replaced the SQL datastore behind a key truck dispatching application with Elasticsearch. Using a MySQL-based repository they were suffering with "normal" response times in the 30 second range, sometimes much longer. With Elasticsearch, they immediately witnessed consistent 150ms response time. This is 200x speedup! In addition, they can handle more traffic and provide a more natural UI for dispatchers and shipping agents. They’ve gone on to do much more with the search technology (you can watch a session they gave at Elasticon here).
Network Analysis isn’t a market per se; it’s a set of applications of network science that deal with complex networks such as telecommunications networks, computer networks, and social networks. Social network analysis is used heavily in customer service, marketing, intelligence, and law enforcement. These all involve mapping and measuring relationships and flows between people, groups, organizations, and different kinds of connect information/knowledge entities. Typically, the software that runs these includes some very complex, highly mathematical algorithms, used in conjunction with data from many sources integrated using – you guessed it – a conventional SQL-based data mart.
Search technology is making serious inroads in this area as well. Palantir, for example, has done very well in critical intelligence applications.
All of the advantages of search technology mentioned above apply to this domain too, plus another one:
For social networks, this kind of fuzzy matching is quite important. People’s names, for example, vary in different settings (I am Jeff Fried in some places, Jeffrey A. Fried in others, and my name is often misspelled as Freed). The connection of people to each other or people to subject areas is not binary – there are different strengths to those connections as well as other nuances.
BA Insight took advantage of this when we built our Expertise Locator product. To identify experts, it’s important to use their whole digital footprint, which can include documents, user profiles, tags, work and billing records, project histories, bios, blogs, posted discussions, emails, and more. There are different levels of expertise and it is important to provide users with a natural user experience to explore available expertise and trade off different criteria. It was natural to build this on top of search.
I would never claim that a single technology covers everything - there’s no silver bullets. There are things that databases do very well that search engines are bad at, for example transaction processing. Databases are built to handle online transaction processing (OLTP), which is the underpinning of everything from banking systems to inventory management. You would never use search technology for these kinds of transactions.
There are also some areas which historically have been advantages for SQL-based data marts over search engines, but where the gap is recently closed or closing. This includes:
This too has shifted in the past few years. The density of both relational databases and search engine has reached the point that many applications use the minimum hardware (usually dictated by having at least two servers for fault tolerance). As you scale up, there is still a difference but often only on the order of 30%. The cost of hardware has dropped so that this has become less material. And the labor cost associated with that hardware is small – in fact, the availability of cloud-based solutions makes these costs almost nil, as well as reducing the direct cost of infrastructure.
Search engines are still going to be larger than databases (depending on the data mix), but the small difference in hardware nets you huge payback in query performance and flexibility.
There is no doubt that search technology is now fully capable of running applications traditionally handled by SQL-based data marts. This approach is now mainstream. There are some significant advantages to be gained in the process, including dramatically improved performance and remarkable flexibility.
If SQL databases are what you are used to, then they will feel more comfortable than using a search engine (or any other type of noSQL DB), at least at first. This is human nature. But you will discover that there is a fast-growing community around search, and a good body of training material available. With Elasticsearch downloads now numbering over 100 million, you are in good company. You may also have the option of simply licensing a product which is built using search instead of a SQL database (for example, BA Insight’s Smart Analytics, Expertise Locator, or Matter Comparison applications).
As I was writing this blog post, I went back and looked at an article I wrote nearly 10 years ago about search on structured data. We were doing exciting things back then, but it’s nothing compared to where search-based data integration is today. This is not a quick fad; it’s been a decade in the making and has huge momentum. I definitely feel like the magnitude of this change has snuck up on me. It’s kind of the reverse of the well-known quote about estimating change in the future. The last line, though, applies whether you are looking forward into the future or back into the past: don’t let yourself be lulled into inaction. The next time you are considering a new Data Mart, take a look at using a search engine instead.