Goal of this Software Tool

Desired Features - With the experience of many Data modeling tools, a feature set that included all of the best, cherry-picked from each, was intended. By collecting only what was needed, a superior tool would be established.

Save Money - Data Excelerator was created to absorb as many functions of existing tools and then pricing it low (100 USD) enough to make the purchase decision easy.

Deploy Widely - By creating a tool that did what was necessary and making it cheap enough to deploy frequently in an organization, the hope is to re-emphasize the importance of good data design.

For instance, if all the members of a scrum team have a copy, then the data model becomes a part of the agile democratic process. There can, and should be, be an overall arching data architecture. But the Enterprise Data Architect can monitor the details with less work than doing it all themselves.

Primacy of Design

There is an old joke in programming. The manager stands at the door of his shop and says to his crew, "You guys start coding. I'll go find out what they want."

This little story is an exemplar of how the IT discipline treats the design part of software development. It is an overhead to be skipped, while someone else figures out the requirements. A lot of this comes from the programmers' ability to code their way around problems. If a mistake is made in the design (or lack of one), then they can simply fix it by coding a solution. In some ways, the advances in people's ability to create software has not resulted in productivity increase, simply because they are creating the wrong software half the time.

The ability to code and re-code directly is why NoSQL databases are so popular with programmers. They have complete control over what is stored. They do not have to interact with a DBA to get changes made to the structure. The speed of these databases is also a reason, but the lack of process friction makes them usable to developers.

The de-emphasis on design has also come about because of the development teams' lack of involvement in the design process. Only senior staff or specialized staff end up being part of the design process.

Agile organizations are a little better and a little worse. The design needs to be done before the formal start of the scrum processes. But the team should be the people who do the design, rather than the senior staff.

So then there needs to be a tool which is inexpensive enough that many people can be part of the design process. If all developers can do database design, and then negotiate democratically what the final result should look like, they will do the thinking that makes a solution more resilient to change. Hence, there will likely be a better result. Just as the Agile process achieves with code.

Obviously, one version of a diagram will need to be the master and there is of course the overhead of absorbing changes from many different authors. But, that is what tools like github do well. Merging code is the same process as merging changes to an XML file of a data model. (Assuming the data model is stored in a clear text format like that.)

Primacy of Design in Data Modelling

Many database engines offer data model tools as part of their software. (Toad, SQL Developer, MySQL, Oracle Designer, etc) They can do forward or reverse engineering. They do it by manipulating DDL directly. They can read the database catalogue and generate change DDL.

However, in general, these tools exist to help lock you in to the back-end. They tend not to store metadata about the design process; except perhaps in the database that they are attached to. They are not really a design tool, but a documentation and DBA tool.

The tight coupling of a Data Model to the database is one of the reasons Data Excelerator was designed with database independence. With Design time data types that can be later mapped to the target databases, the design process can begin before a database engine is chosen. It is also then easy to change databases without affecting the design.

The database design must be the master version as the development process continues. If DBAs are allowed to change the database directly without being reflected in the model, then the documentation becomes out of date and the change DDL will not necessarily work as expected.

No Physical Vs Logical

Most Data Modeling tools support this unnecessary distinction. The requirement for this feature is hidden in the early days of mainframe computers. In the large organizations that had large computers, they had correspondingly large departments that did different functions. The Data Modelling function was usually done by a Business Analysis group as part of the SDLC design process. They passed off the design to the Operations Database Administrators to implement the model. The DBAs would add single field physical keys to replace the logical keys defined by the business; as well as other changes to facilitate performance and storage concerns.

Another old reason for the distinction was to support different field order between logical and physical. In the old days, when disk was expensive, there were rules of thumb around how to sequence fields for best use of disk space and extents. With current low storage costs, these concerns are no longer valid.

Given that the old patterns above are no longer necessary, a modern data modeler will create a table with a numeric key (with no meaning) as the table's primary key as a normal best practice. (Quite often an Identity Key.) Any logical keys will be defined as indexes (unique or not, as appropriate) to facilitate querying.

Essentially, making logical equal to physical is the modern design stylistic standard.

Referential Integrity - Data Needs it

At a data level, referential integrity is the requirement that tables that are conceptually linked by a data field actually implement the key relationship when data is stored so that an entry in a child table actually relates to an entry in the parent table.

To ensure that identification is performed correctly, the parent table has to have a primary key value that is unique across the table. A copy of this value is on the child table that relates to it which is where we get the nomenclature of "relational" database. So, as a minimum, each table that has a child entity needs to have a unique key.

Single row updates also require a unique key. For data that is very numerous, the overhead of creating a unique key is perhaps too high to justify it. If this is the leaf node of a data hierarchy, then you can get away with this. But if there is a requirement to update a single row at a time, then there is no alternative.

Any child table that has a copy of the relational key as a foreign key field should be able to assume that the key value maps correctly to one in the parent table. If it cannot, then the row in the child table is known as an "orphan". The correctness of this data requirement is known as referential integrity.

Referential Integrity - A History

In the early days of Relational Databases, IBM's DB2 was the only game in town. This was the world of big Iron and correspondingly big departments of "wizards" who ministered to it. This is still similar to the Oracle world today.

Two of the major sets of wizards were the Blue programmers and Red DBA's. The Blue team wrote code to interact with the database and followed their processes to promote code to production and testing, etc. Because they came from a hierarchical and networked database environment, they made mistakes and sometimes forgot to delete child records. (A function they got for free in the older database types.)

The Red team defended the integrity of the database as they saw it. This was still during the days when disk space was expensive. So orphan records were not acceptable since they took up valuable disk space. Rather than doing expensive outer joins to find these orphans and clean up the unused disk space, they came up with Referential Integrity to ensure that the Blue team did their job to the expected degree of rigour.

Nowadays, disk is cheap and so there is less value in preventing orphans in the database. Also, better testing methodologies and modern programming practices make it less likely for the Blue teams to forget to ensure clean data in the database. There are also times when simply orphaning a child data row implements a logical delete, which is less expensive than deleting single rows from a large volume table.

Referential Integrity - What is so bad about it

With DDL, the database engine is told which fields relate two tables together. With this information, the database can ensure that those fields are filled in with correct values. What is wrong with that?

RI forces a programmer to respect the need of a row in a child table to be linked to a row in a parent table via its Foreign Key. The database engine does this by returning an error to the program if a value is inserted or updated with incorrect values. The programmer can trap that error and respond to it. This behaviour helps ensure good data.

RI also implements an automatic process with a table deletion operation to delete child records so that there is no orphan data.

COST - The negative impact of RI is that the database engine is now engaged in ensuring that each INSERT / UPDATE operation of a table with an FK constraint on it has a value that links back to the parent. Not only is there processor power expended on doing this check, but IO is done to read the parent table's index key. So, there is a negative speed impact to the INSERT operation when RI is used.

Similarly, a DELETE operation on a parent table will result in extra IO by the database engine as well as processor overhead to perform these tasks. While this ensures data consistency, it is not good practice for a developer to depend on this feature when implementing a deletion step. There are also times when simply orphaning a child data row implements a logical delete, which is less expensive than deleting single rows from a large volume table.

The only current value of Referential Integrity is really, ironically, only for diagramming. With RI relations defined, the links between tables are documented. So, when reversing a database catalogue into a diagram, the links can be painted accurately.

So each database implementation should question whether the overhead is worth it. Perhaps the lack of RI is one of the reasons that NoSQL databases are so much faster?

Referential Integrity - And No-SQL databases

No-SQL is not a good characterization of this type of database. SQL is just an interface language. These types of databases simply have an API, usually consistent across languages, which can be used to retrieve data from the database. A better label would be No Structure database. What is usually stored is a "document", usually field name=value pairs encoded in JSON or XML.

One of the reasons that this kind of database exists is the work involved in defining a table layout with DDL in more standard Relational Databases. The Data Modeller or DBA of an organization would insist on proper definition of field names, data types, and lengths as part of the design process. Then the developer would wait for the database to be defined and proper security access to be given. If there was a change in requirements (which is normal and frequent), then there would be a similar process to change the data definition in the database. All of these processes took time which frustrated the developers. The DBA function became its own worst enemy in terms of ensuring good structure.

The natural result of the slow process was for the developer to create a database which did not require so much advance design and implementation time; such as mongoDB, Cassandra, Hadoop and many others. (There are lots of other things that these databases do well, but this discussion will focus on the notion of structure.) With a JSON document storage structure, the developers can call a field whatever they want to, increase the size of a string field whenever they want to; even change a data type whenever they want to.

However, the flexibility of the No-SQL database tool means that it is easier to make a mistake. One of the reasons for the rigidity of the relational database environment when it comes to structure is to prevent mistakes. Preventing duplicate field names, ensuring that data values are valid, that keys are unique, etc. These are all things that are standard in RDBMS engines.

So, to ensure high quality data, the structure must be imposed from the outside, by the developer, instead of by the database engine. A data design will help the developer to use standard data nomenclature. For instance, all data stores that contain the concept of a person's first name will be called "FirstNm". In a No-SQL world, that would involve the developer using the design to ensure that the JSON document used the consistent data naming the design implied.

The other major requirement would be to implement keys in the JSON structure as defined in the design. Without a primary or foreign key stored in the document, the data is essentially useless, since it cannot be "related" to any other data entities in the organization.

So, a proper design and implementation process can ensure that Referential Integrity exists in a No-SQL world without the negative tradeoffs made in Relational engines.

About the Author

My name is Robert Nobel. I have been data modeling since 1987 when I used a tool called Excelerator from Index Technology under MS DOS / Windows 3.1. This was primarily for Mainframe databases like Model204 and DB2.

When I moved to the PC, I moved to ERWin V1.1. (I think I had the third license in Canada! Since then I have been using ERWin for all my Data Modeling needs. I have also used Systems Architect and Oracle Designer and many others as well. However, I missed some of the features of Excelerator and wanted to cherry pick some of the other features from that and other tools.

I investigated open source data modeling tools and found a couple (Mogwai, Model Sphere). However, they didn't meet all of my requirements. And the codes bases were too complex to change to accommodate my requirements.

However, both these tools showed that what could be done using Java and that I only needed to start to put it together.

Data Excelerator is the result. Coded from scratch, no libraries other than Apache POI for the Excel Data Dictionary interface. Yes, I used Design patterns like MVC, Singletons, and Proxy.

Hope you like it and find it useful.

Feel free to agree or disagree with anything I have written.

Please send me those comments at arguments@dataExcelerator.com