Introduction Clinical databases may contain a large variety of data from different domains, eg, patient visits, test results, laboratory reports, diagnoses, therapy, medication, and procedures. Clinical databases may have different purposes, eg, patient management, electronic patient records, clinical research, and quality control. Clinical databases usually have a large number of users with different requirements for views of the database. The administrator does not want to view data per patient, while the nurse must be able to lookup current medication for a specific patient. The researcher may want to do data mining on clinical information for thousands or millions of patients, and the clinician should be able to see his or her ambulatory schedule. Most clinical databases comprise only a part of these functionalities, but these examples illustrate the challenge that designers of clinical databases face. Furthermore, in contrast to schemas from many other domains (eg, finance and public administration) the logical data schemas of clinical data are always incomplete and developing. In databases, an entity is a single person, place, or thing (eg, patient or diagnostic test) about which data can be stored. In conventional relational database design, each entity is mapped to one or more tables using values of one or more rows to uniquely identify each record. That means that for each entity there exists at least one table. This strategy works well for most databases even if the number of concepts involved in a domain may be high. As long as the domain of interest remains relatively unchanged, the table layout (ie, the physical schema) should work well for many years. The domain of clinical science in particular (and biology in general) is, however, under constant development as new concepts appear and old concepts are modified or deferred. In a conventional database (that is, in a conventional relational database), new tables must be created to record new concepts. To give users access to the new tables, new forms must be designed and links to these forms must be provided in the user interface. If a table that is already in the database needs to be modified care must be taken not to destroy existing data and not to break any constraints. Accordingly, user-interface forms must be redesigned to reflect changes (eg, fields that have been added or removed) in existing tables. The complexity and the rapid evolution and expansion of the domain of clinical information thus require a large maintenance overhead if data are laid out using a conventional design. For this reason, it is desirable that a clinical database be flexible and allow for modifications and for addition of new types of data without having to change the physical database schema. The ideal clinical database would therefore implement a highly-detailed logical database schema in a completely-generic physical schema that stores the wide variety of clinical data in a small (and constant) number of tables. The aim of this project was to provide an overview of techniques and problems in generic design of Web-based clinical databases. Methods 1 2 Results The final PubMed search was done on July 11, 2003 using the search term: (generic database design clinical) OR (entity attribute value). This term was translated by PubMed into: (((entity[All Fields] AND attribute[All Fields]) AND value[All Fields]) OR (((generic[All Fields] AND ("databases"[MeSH Terms] OR database[Text Word])) AND design[All Fields]) AND clinical[All Fields])). 3 9 Google was searched on the same day using the search term: clinical database generic design. 10 11 The 9 resources were all from either of 2 research groups: Department of Medical Informatics, Columbia University, New York, NY and Center for Medical Informatics, Yale University, New Haven, Conn. Three production databases were the basis of the 2 group's research: The Clinical Data Repository at Columbia-Presbyterian Medical Center (CPMC), the Adaptable Clinical Trials DataBase (ACT/DB), and SENSELAB. 8 10 3 4 6 7 11 7 4 5 The principles and design issues involved in these databases are the focus of the remainder of this paper. I will not go into details about the specific implementations of these systems, rather I will present techniques involved in the design of generic database systems. For design details about the 3 database systems the reader is encouraged to consult the references. Entity-Attribute-Value Design In conventional database design, each parameter of interest is represented in a separate column in a table. As new kinds of data need to be managed, the number of columns and/or tables needs to grow. Table 1 Table 2 Table 2 The EAV design has several advantages: Flexibility: Storage: Efficient entity-centered queries: The EAV design has, however, some drawbacks: Data display: Less-efficient attribute-centered queries: J Constraint checking: Metadata logical schema Table 2 Evolution of the EAV Model In the following sections, I give examples of different EAV schemas going from the most-simple, least-flexible to the most-advanced, most-flexible schema. The term "simple" is not to be interpreted as inadequate. The simple solution may be the right solution for a specific task. The examples reflect the systems described in the literature but are simplified for pedagogical reasons. A Simple EAV Model Figure 1 Table 3 Figure 1 Table 3 Table 1 The entity part of the Data table is defined by the combination of patientID and date. The attributeID column holds a reference to the Attribute table, which defines the name and type of available attributes. In a real-world production database there would probably be another table to hold the definition of data types. Table 3 Table 4 12 Table 5 This approach is used in CPMC, ACT/DB, and SENSELAB. For simplicity I chose to show only 1 data table in the illustrations. The modeling of patient demographic data in a separate conventional table rather than in the EAV table is deliberate (although not necessary). For a schema that is not expected to change often, as is the case with patient demographics, the advantages of an EAV layout do not exceed its disadvantages; and conventional tables and EAV tables can coexist happily together. Furthermore, this design makes it easy to model the one-to-many relation between patient and clinical events. Relations between entities in an EAV table are complicated to model in the simple EAV design. In an electronic patient-record system, for example, it should be possible to record relationships between clinical events (eg, infection leads to a course of penicillin or myocardial infarction leads to death). The enhancement of the EAV design to handle complex relationships between classes will be described later with the EAV/CR schema. For a simple application intended mainly for data entry, the simple EAV schema may suffice. With the need for a more-advanced user interface for data-display and input purposes, however, some means of grouping attributes becomes necessary. With the simple EAV schema, grouping attributes together on display forms may be done only by entity (patientID and date) or attribute. The application has no way of telling how EAV data records are related and should be displayed together—eg, multiple values from the same blood chemistry panel. Enhancing the EAV Model Figure 2 A group table and a form table have been added to the metadata schema. Attributes may now be grouped and attribute groups may be part of forms. To the entity part of the Data table a new field, formID, has been added telling the application to which form a data record belongs. Now any medical event recorded in the Data table belongs to a form and then may be displayed together with all the other attributes on that form. Furthermore, this design facilitates reuse of attribute groups on different forms. Depending on the domain being modeled and the requirements of the users, other metadata schemas may be suitable. The simple and the enhanced EAV schemas discussed above are examples of the use of generic EAV tables in clinical database applications. Although to some degree generic, the proposed schemas will need adjustment to the actual domain in question. To achieve total domain-independence more refined models must be created. An Object-oriented Approach to EAV Modeling 5 Figure 3 Table 6 inheritance composition is a is a has a Thus, with this simple layout with (conceptually) just 5 tables, any real-world object can be recorded. Furthermore, objects may be part of other objects; and objects may be related through inheritance. Ad hoc relations between objects (eg, penicillin leads to rash) may be recorded as objects themselves. For this purpose, a class, ObjectRelation, could be defined with 2 attributes, objectID and relatedObjectID. More descriptive attributes may be added to this class if required—eg, causality. Obviously, considerable up-front programming is required to drive an ergonomic user interface for the EAV/CR model in a real-life production environment. On the other hand, this is a one-time-only job. Another drawback of the EAV/CR design is that the system administrator must have a solid understanding of the object-oriented framework in order to design useful classes. An EAV/CR database is therefore hardly an end-user tool for the average clinician or researcher. As always, flexibility comes with a price. Querying EAV Data From a database perspective, querying EAV data is not different from querying conventional data. As mentioned earlier, however, in an EAV database, the physical layout differs greatly from the logical layout, and the user generally wants to see data displayed in a conventional format. Table 1 Jens SELECT * FROM table1 WHERE name LIKE 'Jens%' AND dob < '1970'; Table 2 SELECT d1.patientID AS patientID, d1.value AS name, d2.value AS dob FROM table2 AS d1 INNER JOIN table2 AS d2 USING (patientID) WHERE d1.attribute='name' AND d1.value LIKE 'Jens%' AND d2.attribute = 'dob' AND d2.value < '1970'; The same result may be obtained in several ways, but in any case the query must include set operations (INTERSECT) or as in this example a self join for each attribute. (A self join is a join of a table with itself.) Aside from being complex and out of reach for most end users, these operations are far slower than simple select statements. I did an experiment using data for one million patients described by 3 attributes: name, date of birth, and gender. These facts were duplicated in a conventional table and in an EAV table in a MySQL database. Three queries were performed on each table with 1, 2, and 3 attributes respectively. Execution time was approximately 2 seconds for the conventional table irrespective of the number of attributes. For the EAV table execution time was 7, 14, and 24 seconds respectively. Thus execution time increases linearly with the number of rows (1 million in the conventional table and 3 million in the EAV table) and—in the EAV table—with the number of joins involved in a query. In the conventional table, however, the number of joins did not affect query time. Some strategies have been suggested to deal with this problem: There may not be a problem. Attribute-centered queries are important for research questions; their performance is not critical for the care of individual patients. If the need for cross-patient data is infrequent the advantages of EAV design probably exceeds the disadvantages. 6 6 3 10 In summary, querying EAV data is a more complex task than querying data in a conventional layout; and attribute-centered queries are less efficient with EAV data compared to conventional data. Graphical User Interface The challenge for the user-interface designer of an EAV database is to display data and to let the user manipulate data simulating a conventional layout irrespective of the physical layout—in other words: to bridge the physical and the logical schemas. The World Wide Web offers an opportunity to simplify database deployment and maintenance. In a typical Web database application, the user's browser requests data from a remote Web server, which sends the request to a database server. After receiving data back from the database server, the Web server formats it into a Web page and sends it to the client browser. There are several advantages of Web deployment: Problems of form deployment are eliminated since all forms reside on the Web server. Deployment costs are reduced because Web browsers are available free. Also, hardware costs are reduced since browsers usually have smaller hardware requirements than desktop database-management systems do. The form-rendering model of Web pages is simpler and smarter than that of traditional software platforms. Objects on a Web page can be automatically reformatted when the browser win­dow is resized or the user changes the font size. Traditional software developers must put much effort into physical screen size issues. This is not necessary with Web forms. Web browsers use clever caching algorithms. That means that when the browser visits a particular page, its contents are cached on the client. On revisit, only components that have changed are downloaded again. This reduces download time and network load. For these reasons, Web deployment is becoming more and more popular for multi-user applications. However, Web database applications are significantly more complex to develop than traditional database applications for several reasons: Web-development tools are less mature than tools for traditional software development; and development of Web database applications still requires much "coding-by-hand." As an example, simple errors such as misspelled variable names, which would be trapped at edit or compile time in a traditional environment, will not be detected until runtime in a Web application. Browser-server communication is inherently stateless; when the server has sent a Web page to the client, it "forgets" about the client. Tracking information (eg, user authentication) through several Web pages therefore involves extra programming. To maintain information, the developer must store data either in (hidden) form fields on Web pages or in session variables, which can be accessed as long as the session lasts. Both approaches complicate development and may compromise security because other users (or processes) may gain access to these data intentionally or accidentally. Designing Web forms requires much more programming than does designing forms in traditional client-server environments. Web form fields are typeless and input masks for formatting user inputs are not inherent parts of Web forms. (In typeless fields the user may accidentally enter numbers in text-only fields or accidentally enter text in numbers-only fields.) This puts pressure on the programmer to put much effort into both client-side and server-side data validation. In a traditional environment, form fields may be typed; thus, eg, the programmer does not need to worry about users entering letters in number fields or invalid dates in date fields. In a Web form, all validation procedures must be hand coded. Finally, population of select boxes (drop-down menus) and radio buttons (option buttons) with dynamic data is usually much easier in a traditional environment than on a Web form. 4 WebEAV makes extensive use of client-side validation of data. Standard validation code in the form of JavaScript is built into the Web page. Validation relies on the use of form field events (eg, OnChange, OnFocus, and OnBlur) and metadata for the attributes in the form (eg, data type, maximum and minimum bounds, and non-null requirements). Discussion Based on searching the literature, it appeared that the Entity-Attribute-Value model is useful for generic design of clinical databases. The most advanced model uses an object-oriented approach and gives tremendous flexibility, allowing the designer to model any type of concept and any relation between concepts in the domain of interest without ever having to worry about changing the table layout or maintaining the user interface. With the ever changing and evolving domain of clinical information, generic design is of special interest for clinical databases, because changes to the logical schema will not affect the physical schema. However, database designers from other areas (eg, biology or literature) may also find the EAV approach useful. 13 14 14 15 Pros and Cons of EAV Design The advantages of generic design are obvious. The disadvantages, however, may be less obvious and depend on the objectives of the specific application in question. From a performance point of view, the strength of the EAV design lies in effective entity-centered queries since no joins are necessary to retrieve all facts about entities (eg, patients or medical events) as would be the case in a conventional design with facts spread over hundreds of tables. The drawback lies in inefficient attribute-centered queries, since a (self) join is necessary for each attribute that is requested. Performance of EAV tables may not be an issue for small databases, but for large clinical repositories with hundreds of concurrent users, query time may be a critical factor. Also, the need for complex attribute-centered data retrieval differs greatly between applications. An electronic patient-record system, for example, is usually aimed at displaying patient-centered (ie, entity-centered) facts, while a research database usually must have some means of aggregating data across a large number of patients. In the latter, however, query efficiency may not be a problem, since data summaries are retrieved only intermittently and may be stored on separate hardware. These issues warrant careful design of the database schema and cautious decisions about when to use conventional tables in place of generic EAV tables. As a rule of thumb, conventional table design is appropriate for entities whose schemas are not expected to change often (eg, people or institutions). Metadata Preserves Information Figure 1 Figure 2 Figure 3 It appears that metadata schemas themselves may be more or less generic depending on how closely related they are to the actual domain being modeled. The more specific the metadata schema is, the less flexible it will be. On the other hand, a specific metadata schema will require less programming to drive the user interface than a highly generic one. To summarize this part, the choice of model depends on the domain and the requirements for flexibility. The object-oriented approach is by far the most flexible solution and in many ways an elegant solution. On the other hand, the complexity introduced by this model may not be justified unless the domain requires the fine-grained control over objects and relations. A simple model may well be the right solution for a simple job. Databases and Objects Much effort has been put into generalizing clinical databases. The most flexible and generic models take an object-oriented approach to data modeling the mapping of objects to tables in a relational database. There is no doubt that object-oriented design is "hot" in the medical area. But porting of object-oriented generic databases from traditional relational databases to produce object-oriented database management systems (OODBMSs) does not seem to be just around the corner. One reason for this may of course be that object-oriented database management systems are still lagging behind relational database management systems with respect to efficiency and availability, although extensive research is going on in this field. Furthermore, object orientation is still a new concept to most clinicians who design databases. But even with modest skills in an object-oriented programming language such as Java, the similarities between object-oriented programming and object-oriented data management seem striking. 16 object-relational persistent object-oriented The former approach has similarities to the approaches described in this project in that these build upon conventional relational database management systems. The SENSELAB database allows for composition and inheritance, and CPMC has explored the extension of SQL to facilitate attribute-centered querying EAV data. generic The object-oriented paradigm ("everything is an object") is a means of describing real-world concepts, and objects may be easier to understand for a clinician than complex relationship sets in a relational database. One could say that object-oriented design brings together the logical and the physical schema. Even if this may not be completely true, the user should not have to worry about how to design tables for storing of objects. The database will take care of this. Object-oriented languages handle complex attributes and inheritance much more elegantly than do even the most cleverly-designed relational database. When referring to an object in an object-oriented programming language, the object's fields and methods are available to the user immediately, through the object's interface. To mimic an object in a relational database, the database must be queried for all attributes of interest, and each value must be accessed separately. Objects may contain methods. For example, a person object may contain a print() method, which outputs all information related to the objects in a suitable format. The client programmer, who builds the user interface, does not have to worry how this information is gathered. This programmer only needs to grab the information and present it in a nice layout on a form. Furthermore, different subtypes of the person class, eg, patient or doctor, may have different implementations of the print() method. This is an example of polymorphism and is one of the most powerful features of object-oriented programming languages. Classes may be reused. If a class has been designed, it may be reused in other applications; and if a class is redesigned (eg, to improve execution speed) the client programmer does not need to know this, as long as the class' interface is unchanged. encapsulation polymorphism It is obvious that these (and other) facilities of object-oriented programming languages would be of immense value in the creation of generic clinical databases. It is, however, important to realize that a database management system, whether object-oriented or not, comprises much more than a programming and query language—important issues being storage management, transaction management and concurrency control—and these issues are still under development in object-oriented database management systems. (Concurrency control involves locking parts of the database to prevent unintentional overwriting of data.) Conclusions The objective of generic database design is to provide a robust physical database schema that does not need to change as the domain evolves. Generic databases are of special interest for clinical information systems, and several approaches to generic design have been exercised. They have in common the use of Entity-Attribute-Value tables for storing data and a number of metadata tables to describe the semantics and the relations between data. An object-oriented approach to generic modeling of metadata is by far the most flexible and domain-independent approach. However, the overhead in taking this approach may not be justified for less-advanced applications. Further studies regarding the implementation of object-oriented database management systems for the purpose of generic clinical databases are suggested.