Unsound and Incomplete: MongoDB and the Single View of the Customer

There was quite a lot of publicity generated a few weeks ago by 10Gen, the commercial organisation behind MongoDB, about how one of their major financial services customers (MetLife) was using their product to create an application that provides a "360 degree view of the client". I have been leading a programme that created data services that have been used to satisfy similar use cases for nearly three years and wanted to share my thoughts on the 10Gen approach. Note that I have absolutely no knowledge of the MetLife solution beyond what I have read in the 10Gen press releases and in the more technical webinar.

I should state at the outset that I think MongoDB is potentially a very interesting technology to use to create a data service (which is what I believe this problem is really about), for reasons that I will go into later in this post. I do think however that there are several challenges and considerations not really discussed in the webinar that anyone doing this type of thing really needs to think hard about up front.

In case you haven't watched the webinar, the key reasons cited in the presentation to support MongoDB being a good choice are:

The lack of strongly enforced schema makes it more agile therefore delivery is faster and cheaper
Performance compared with relational databases is much higher (reduced latency as well as being inherently more scalable with built in automatic sharding)
Rich querying (compared with nosql solutions that provide a basic key/value interface)
Aggregation framework allows roll-ups to be done in the database (again compared with other nosql solutions that may push more work to the client)
Map/Reduce either natively or connected to Hadoop to support offline analysis of the data

Managing Expectations

The webinar states that the first release was to support the application used by the contact centre staff but that the intention is that other applications will use the data service (including client facing web apps). This is a key point - I am sure that the business sponsors will now have an expectation that IT can now just hook up different applications with minimal cost and time. And that existing applications will continue to work.

With any data service, the first release is by far the simplest release. You can change anything without worrying about the impact on existing applications. However it is a very different story as soon as you have multiple different applications consuming the data. They will all likely have their own release schedules and budgets therefore they won't want to do releases at the same time. Unless you are working in an organisation that is extremely agile this is going to be a big issue if you don't plan for it.

Build a Solid Foundation

For the reason noted above, I would always recommend building your first release for more than one application - a single application may skew the use case and your data model will get better scrutiny. Once you have applications in production, making any non-evolutionary (i.e. breaking) changes to the data model becomes more and more expensive as the number of consumers grows.

The webinar only briefly mentions the technology that sits on top of the database (I read into it that it was an XML service but in any case there will be some kind of typed model). Whatever the format of the data, it is critical that there is a very detailed data dictionary and that the schema is documented with samples. Like most complex domains, there are many different but valid ways of modelling common concepts in financial services therefore good samples need to be provided.

This to some extent makes the first claimed benefit of Mongo an entirely moot point - lack of enforced schema is really not an advantage when writing an enterprise data service because change in such an environment must be planned and controlled among different groups. (Writing a data tier for a single application would be a different story entirely). You do however want to have tooling that can help manage the change and evolution of the schema (things I will discuss in future posts).

Canonical Model

The other point in the webinar that is in my view short-sighted is that they appear to conform very little of the data (at least that is the impression they give). Effectively they serve data in the format staged from source systems. This is not a unique approach - I have seen several applications built using data from the staging area of a data warehouse (mostly tactical I should add). There are a few obvious problems with this approach:

It forces the burden of interpreting the different source system structures and conventions onto the consumers of the data. At that point your consumers will likely question what value the enterprise data service is adding.
If a major data source is upgraded resulting in changes to its data model then consumers have to deal with that.
Unless you conform the data, it becomes harder to aggregate it without introducing errors, or display it unambiguously to clients. Some simple examples: are the prices for bonds clean or dirty? Are taxes and fees split out or consolidated? Trade or settlement date accounting? How are backdated corrections handled?

The last point is particularly important if you are intending displaying data to clients directly on a website. But it is also very relevant to anyone building any kind of MI platform especially if the users do not understand (or do not want to have to understand) the intricacies of each data source.

Performance

The claims about performance and scalability are made without mentioning competitor products so it's hard to comment on them. While the established relational databases generally do not have a scale out model or have only recently introduced an in-memory model (e.g. SQL Server 2014 CTP aka Hekaton), I would be interested in comparing the performance with an in-memory, scale-out relational database such as the outstanding VoltDB.

While the querying cited as an advantage is certainly rich compared with other NoSQL solutions I have used, the example in the webinar is extremely simple and does not really attempt to show how it compares with a relational solution. Claims about performance would be far more credible if they used a complex example involving filtering, grouping and aggregation.

Simplicity

The most obvious advantage, in my view, of the MongoDB solution would be the vastly reduced complexity of constructing an XML representation of the data. This is clearly something that hierarchical databases are well-suited for. (Although hierarchical databases are not new - MUMPS anyone - and come with their own set of limitations). The webinar does not really highlight this aspect as I recall.

Conclusion

I think there is a great deal of mileage in exploring how a hierarchical (or document) database could be used in an enterprise data service, but I don't think the 10Gen webinar covers a lot of the areas that people with the battle scars of implementing such a service would identify with. In particular, unless you are taking an extremely short-term view, the schema flexibility is not a major bonus for this kind of application.

There are several other areas anyone embarking on this kind of project needs to think about carefully to ensure long-term success and lower cost of ownership and I will blog about these in future: e.g. schema and service evolution, test automation of data acquisition (ETL) processes (most of the well known tools offer very little to support this essential part of modern software development), and supporting 24x7 operation including systems upgrades in a batch environment.

Unsound and Incomplete

Pages

Saturday, 3 August 2013

MongoDB and the Single View of the Customer