Document Oriented Development

Two nights ago, I was editing the "So what? Who cares? Why would I ever want to use CouchDb?"* section on the home page of the CouchDb Wiki. As I was feebly trying to explain what CouchDb is good for, the words "document oriented application" popped into my head. I immediately liked it, it felt like I had a term to concisely describe the sorts of applications CouchDb is made for.

Today, I decided to Google the term "document oriented". Turns out it's not new, here's an article I found Towards truly document oriented Web services on the O'Reilly site. The article gives and example of a REST API that is similar to the one I will be exposing with CouchDb. Cool.

"Document Oriented Development" I think this may be a poorly served yet hugely important area of application development. Particularly in storage and management. For document storage, you pretty much have two options in mainstream development, direct file system access and relational databases.

Traditional file based systems are simple enough, this is how most PC applications have dealt with documents for a long time. MS Office is a prime example: all documents are files. But a lack of a reporting capabilities and concurrency control limit what can be done, particularly in web applications.

And relational databases? There is nothing "relational" about documents, yet the vast majority of document management systems are built on top of a RDBMS. but unless normalized to the 4th normal form, you'll need a fixed document schema, limiting flexibilty. But when normalized to 4th normal form, performance suffers. Badly. And not to mention SQL queries become unwieldy.

XML databases are meant to solve these sorts of problems. There is even a standardized query language for it: XQuery. XML databases are great if you want to think of everything in terms of XML. But from what I've seen, XML databases will simplify development only if your data is already XML. Even then, I'm not so sure.

It seem ridiculous there aren't more mainstream tools to deal with this style of development. Lotus Notes got so much of this right over 15 years ago, and it's still singularly unique in its capabilities.

Define It?

I'd like to come up with a good definition of document oriented development, but the idea is still pretty nascent in my brain. This is what I wrote on the wiki to describe the applications:

A typical document oriented application in the real world, if it weren't computerized, would consist mostly of actual paper documents. These documents would need to get sent around, edited, searched, photocopied, approved, pinned to the wall, filed away, etc. They could be simple yellow sticky notes or 10000 page legal documents. Not all document-oriented applications have real world counterparts.

The Wikipedia has a good definition of document:

A document contains information. It often refers to an actual products of writing and is usually intended to communicate or store collections of data. Documents are often the focus and concern of Administration.

Documents could be seen to include any discrete representation of meaning, but usually it refers to something like a physical book, printed page(s) or a virtual document in electronic/digital format.

Hmmm... getting closer.

docorienteddev.jpg
"Document Oriented Development" - By Ben Batchelder

Anyone want to take a crack at a definition at document oriented development? Or am I all wrong and there nothing particularly special about being "document oriented"?

* (that section heading, along with a bunch of others, was added by Jeff Atwood of Coding Horror. Thanks Jeff).

Posted May 31, 2006 12:18 PM

Comments

Interesting. I mostly define document oriented on the technical level. Looking at it that way, Lotus Notes indeed does it right, as it allows you to define data elements on a document on the fly, instead of having to adhere to a strict schema. However, what happens if you want the capabilities of being document oriented, thus being flexible in your data definition, AND the relational capabilities?

We all know that Notes is not good for this, we really don't want to rely on a developer to keep related data consistent, the database should do this for us.

I myself have been wondering about the availability of a database system that combines the document oriented and relational requirements. While I haven't considered XML databases, I will definitely have a look at them.

What would your advise be if both requirements are needed?

Ferdy, May 31, 2006 2:33 PM

I think the term "documented oriented development" was also used in association with OLE and OpenDoc. The aspect of both of them that most people seem to focus on was their GUI interaction, but they each had support for a structured file format. These file formats were intended to overcome at least some of the shortcomings of a traditional file system.

Laurence Gonsalves, May 31, 2006 2:37 PM

What would your advise be if both requirements are needed?

CouchDb of course! The storage model is not relational, but the computed table model with better query capabilities I think will help close that gap.

And no, I don't think XML databases in any way make this easier. Unless you already have much of your data already XML that is.

Damien, May 31, 2006 3:19 PM

Just don't confuse it with Document Driven Development. :)

I know Microsoft has tooted it's own horn with its document oriented API's for MS Office Applications. Interestingly enough, their API's are very similar to the Lotus Script Object's that came out around the same time as VBA.

Bob Balfe, May 31, 2006 10:24 PM

Funny. We have been developing a "document" oriented application for the last 10 years. It is mainly used as an Electronic Patient Record and has a 30.000 users base in Scandinavia. See here:

http://www.ifi.uio.no/ecoop2004/docs/Siemens.pdf

We are now trying to get rid of that "document" tag we used initially because it is not too well perceived by our customers.

They cannot abastract the term beside "real" paper documents. In our application, a "document" is a very broad concept (icons on the desktop, for example, are "documents" and "PageFolders" are documents built by aggregating other documents). In fact, when one thinks about it, our "document" oriented application is mainly an ORM (Object Relational Mapper) where "documents" define object structures, even if most of these objects happen to correspond to real paper documents.

Well, my point is: be careful using that "document" term if your concept is more generic than that.

David Brabant, June 1, 2006 6:09 AM

Good point David, I've struggled with what to call CouchDb. "Object Database" is probably more semantically correct , but "Object" is terribly overloadeded term the industry. My fear is that it would get lumped in with all the OO databases and ORMs, which is a mistake because its not only a different style storage system system, it's trying to solve a different set of problems.

I've settled on "document database", because I think that will give people the clearest notion of what it's good for, even if it limits their ideas about how it can be used. I'm trying not fall into the trap trying to be everything to everybody, which I see happen all to often in the software business. But then I may be limiting too much the other way.

Anyone have suggestions for alternate descriptions or metaphors for CouchDb?

Damien, June 1, 2006 2:35 PM

[delurking alert]

You're really talking about forms...

Much as we want to bring up a notion of documents, the fundamental unit that is addressible is a form. Documents are opaque, the marketing droids have overloaded the term even if it is applicable here. I would steer clear. Forms cut to the chase... Now I realize that forms are not sexy but that is what Notes addresses, what DabbleDB/Jotspot are trying to address and what it sounds like Couch addresses.

Here's an equation

Notes = forms + views + standard database format + infrastructure for efficient queries + possibilities for replication and offline

Then I would pitch your positioning as follows

Couch = Notes - standard database format + REST api (ergo. an Atompub server) + attention to evolving systems + nicer Damian Katz voodoo borne of painful experience

Now that the web is ubiquitous, there is less of the need for the note, in Notes terms. We have browser clients and such.

So I've been reading over your shoulder for about a year and meaning to comment. Indeed I've started thinking about data and am writing a winding expose and I think you'll figure in, but you've tempted me to surface...

Anyway some pointers that should inform your discussion:

Kragen on semi-structured data

I've recently been thinking about evolving systems and you likely saw the bit about coordination costs.

Alex Russell on the Jotspot model

I hope you've read Bosworth's web of data talk, looked at Google Data and the supporting BigTable stuff. In the same vein I'd look at Joe Gregario's Atom Store dream and perhaps consider OpenSearch although it sounds like you have a formula language.

In closing if document oriented works for you run with it, I think you're doing forms and since forms are glue if don't mind my coinage this is what I'd pitch

Couch = glue layer database for the web

Run with it

[back to lurking]

Koranteng Ofosu-Amaah, June 1, 2006 3:32 PM

Koranteng,

your perception of Lotus Notes, and hence your offered equations (and following conclusions) are flawed:

Notes is all about documents (and unstructured, composite data inside them) and not about forms. Forms are merely UI pieces and a way to represent parts or all of these documents, but a Notes database can very much exist with lots of document notes and literally no form notes at all (for example, when used programmatically for data storage and processing).

Next, for fellow readers to grasp what you mean when you say "standard database format", you really have to define that. What is a "standard database format" in your context? Most likely Notes' NSF database format doesn't match your definition of a 'standard database format'.

Also, I'm puzzled by your casual reference to the Atom publishing protocol when mentioning REST APIs. What's the point here? Just because Atom can use HTTP methods, how does this make Couch in your equation an "atompub server"? It doesn't.

Thomas Gumz, June 1, 2006 11:32 PM

I haven't got a customized definition, but the Document Oriented paradigm would seem to be an environment that has a set of heterogenous object arranged in a tree, with at least one default View of those objects rendered to the screen, and changes in the Objects are always immediately reflected in the View(s).

If there is only one View, it is one very close to the "final rendered product", but there can be more than one view. (e.g., "web page" vs "DOM viewer" view in Mozilla.)

I think it should be left deliberately broad, because this model handles a lot of real-world situations that aren't "documents", and sometimes the document view is only a convenient View, not "the final product". For instance there's nothing wrong with having a "node" that actually goes out and gets data; some web backend systems have been working like that for a while (without really realizing they're in this paradigm), and with XMLHttpRequest I've been conciously using this paradigm in my work, bundling up the widget that grabs and displays the data and the code that does the XMLHttpRequest into one programmatic unit.

Squeak seems to have elements of this model, and of course its brethren may too, it's just Squeak's the only one I've ever run.

The advantage of this paradigm is that it handles a lot of complicated, real-world problems without much extra work. The disadvantage is that it is so general that there aren't many cute tricks you can push off into the framework/paradigm; i.e., there's no obvious correspondance to anything like "joins" or the "ad-hoc queries" you get with the relational model. XQuery is probably as good as it gets and while I'll admit I haven't used it that much, it's always struck me as likely to be very difficult to understand what it will actually do to a set of a million heterogenous documents once you get beyond the simplest queries. You end up needing to add your own "constraints" to the system to make it safely useful for data storage or reading data back out of the "document", and so far I've never seen anything that could help you with those constraints; many of them are somewhat complicated.

Jeremy Bowers, June 2, 2006 11:00 AM

Thomas,

I think you've misread me, but then I wasn't being precise in my handwaving... I don't want to detract from Damian's search for metaphors but perhaps I should clarify...

The note and the Notes NSF format is what I consider to be a 'standard database format'. That's the genius of the architecture that makes things like replication and offline runtimes work. The notion of a self-contained document packaging data and design elements allows for lots of flexibiliy. Infopath is belatedly aiming in this direction.

I only discount the importance of the NSF these days because I believe in the browser as the ubiquitous client and I suspect that most Couch applications will be web applications. Perhaps Damian can weigh in on whether he's planning on emphasizing his equivalent of the note.

Still, language is important and, to reiterate, the word "document" is so tainted and overloaded with meaning that leading with it may confuse more than enlighten. Your mileage may vary of course.

The reason I suggest forms, is that for most people that's the entrypoint into the kind of applications that are built in this area. You can read more handwaving about my characterization of Notes if you want: "forms and views and the client and server processes that can manage them."

The Alex Russell quote about Jotspot that I was getting at was:

"Instead of forcing you to think about some kind of MVC fuss-and-bother, you build what you were after in the first place, usually a form, and then start iterating on the implied structure of that data. You don’t change a model and upgrade a schema, you just add the property you wanted to add."

That's the bit that Notes is good at and evolving systems are what it targets. It should come as no surprise that the first usage examples that Damian has put on the wiki are all forms-based applications which meshes well with the "usually a form" emphasis. When developing K-station, we used Notes for data storage and as exactly the kind of document-oriented database that you're suggesting is possible. We barely used the form capabalities and we got scalability, replication and a distributed environment that WebSphere Portal (built on a traditional relational database) is still trying to match 6 years on. But we were a minority, most uses of Notes that I have seen have been for form-based applications. Maybe you have seen different usage patterns.

How one deals with unstructured and semi-structured data and how one handles the evolution of an application is the most interesting puzzle for me. I'm curious to see Damian's take once his stuff is mature. We don't normally think about these things up front but that is where the bulk of the costs and user frustration are.

Lastly, I fully understand that REST apis don't make an Atompub server, I was however thinking about positioning, and in positioning of a young technology leverage is everything. There is absolutely no reason not to be an Atompub server and leverage that ecosystem. I suspect that any REST apis Damian adds would turn out to be very close to Atompub. So perhaps you can treat that reference as a casual hint or advocacy if you will...

Back to metaphors: I still think it's all about the glue layer.

Koranteng Ofosu-Amaah, June 2, 2006 1:05 PM

I only discount the importance of the NSF these days because I believe in the browser as the ubiquitous client and I suspect that most Couch applications will be web applications. Perhaps Damian can weigh in on whether he's planning on emphasizing his equivalent of the note.

I'm really not sure what you are getting at here, so I'll just state that CouchDb is designed first with web applications in mind. And I also intend that users can take those applications offline and resync changes when later connected. Much like Notes.

I agree that so much when people think about their custom development needs, they immediate think of forms. This is completely natural as most documents in a business, by volume, are form based (W2, Medical Forms, Invoices, Bills, Vouchers, Etc). So, when they look to computerize things, so often the real world counterpart is form based.

However, I think the metaphor of forms doesn't work because so many things like emails and blog posts aren't considered forms, they're are far more commonly called documents (and even more commonly called "emails" and "posts"). In otherwords, "form" I think already has too specific a connotation on the web that would only serve to confuse the already savy. Which is bad. But I do like it for it relevency to real world business documents.

As far as Atom apis and the like, I haven't investigated it enough to know how well it fits into the whole Couch vision (of which I've not written much). I am soon enough going to start with the REST api work, so I'll be investigating it then.

However, even if I don't implement an Atom api now, it doesn't mean it can't be done at a future date. Right now I'm only using "standards" if it simpilfies things. There'll be plenty of time to complicate it later ;)

Damien, June 2, 2006 3:35 PM

Interesting - document-oriented is exactly how I have described both Notes and Workplace Designer based programming, primarily to differentiate it from what you would do with other systems and other tools, like Eclipse or Rational tools. A lot of times, developers don't understand this document thing - it's not transactional, they complain - or, it's not relational! - what do you mean, the schema can change on the fly??!? etc, etc.

So, I understand the struggle to describe why you would want such a system and why it's different and for goodness sakes, what's a document? Maybe I'll post more on this later on my own blog.

Until then,

Chris

Chris Reckling, June 13, 2006 7:05 PM