Real estate is increasingly becoming an asset class subject to the same requirements as other capital investments. As a consequence, the strategic relevance of real estate portfolios has gained in importance for many businesses. The resulting large quantities of documentation and information require a structured database system, in which information and documents will remain permanently transparent, complete, and findable. Portfolio and operating documentation must be reliably and consistently available to a variety of actors, over a period of decades. 

In order to facilitate effective document protection, administration and access at all times, it is necessary to establish a unique structure and identification system for the information. In practice, however, there are a variety of existing standards relating to document structures for particular lifecycle phases and for transmission of the data between specific phases. The documents are consequently subject to repeated restructuring throughout their lifecycle - a process that is expensive and entails a risk of data loss.

The paper describes an approach for unifying and establishing compatibility between the existing document structure standards throughout the property's lifecycle, making use of unique document classes. The goal is to achieve a stable, unique document classification, accompanied by a capacity to automatically classify relevant (and, in particular, unstructured) documents. In this way, in the course of digitalization or migration, it will be possible to directly associate documents with a document class and thus ensure that they have a single unique classification throughout their lifecycle; they can then be displayed (by the users) in restructured forms for specific use cases at any time without incurring additional costs.

In order to determine to what extent this process can be automated with machine learning, a range of algorithms were applied to real building documentation, analyzed, tested for reliability and optimized to building-specific data. 

The analysis demonstrated that not all digitalized documents are directly suited to automated classification; the paper therefore illustrates the associated problems, presenting detailed recommendations for how to facilitate automated classification and migration using machine learning. In this way, major errors can be avoided from the very beginning of the digitalization process.