Data rooms are becoming more and more important for the real estate industry. They permit the creation of protected areas in which a variety of relevant documents are typically made available to interested parties. In addition to supporting purchase and sales processes, they are used primarily in larger construction projects.

The structures and index designations of data rooms have not yet been uniformly regulated on an international basis. Data room indices are created based on different types of approaches and thus the indices also diverge in terms of their depth of detail as well as in the range of topics. In practice, rules already exist for structuring documentation for individual phases, as well as for transferring data between these phases. Since all of the documentation must be transferable when changing to another life cycle phase or participant, the information must always be clearly identified and structured in order to enable the protection, access and administration of this information at all times. This poses a challenge for companies because the documents are subject to several rounds of restructuring during their life cycle, which are not only costly, but also always entail the risk of data loss. The goal of current research is therefore a seamless storage as well as a permanent and unambiguous classification of the documents over the individual life cycle phases.

In the field of text classification, machine learning offers considerable potential in the sense of reduced workload, process acceleration and quality improvement. In data rooms, machine learning (in particular document classification) is used to automatically classify the documents contained in the data room or the documents to be imported and assign them to a suitable index point. In this manner, a document is always classified in the class to which it belongs with the greatest probability (ex: due to word frequency). 

An essential prerequisite for the success of machine learning for document classification is the quality of the document classes as well as the training data. When defining the document classes, it must be guaranteed on the one hand that these do not overlap in terms of their content, so that it is possible to clearly allocate the documents thematically. On the other hand, it must also be possible to consider documents that may appear later and be able to scale the model according to the requirements. For the training and test set, as well as for the documents to be analyzed later, the quality of the respective documents and their readability are also decisive factors. In order to effectively analyze the documents, the content must also be standardized and it must be possible to remove non-relevant content in advance.

Based on the empirical analysis of 8,965 digital documents of fourteen properties from eight different owners, the paper presents a model with more than 1,300 document classes as a basis for an automated structuring and migration of documents in the life cycle of real estate. To validate these classes, machine learning algorithms were learned and analyzed to determine under which conditions and how the highest possible accuracy of classification can be achieved. Stemmer and stop word lists used specifically for these analyses were also developed for this purpose. Using these lists, the accuracy of a classification is further increased by machine learning, since they were specifically aligned to terms used in the real estate industry.

The paper also shows which aspects have to be taken into account at an early stage when digitizing extensive data/document inventories, since automation using machine learning can only be as good as the quality, legibility and interpretability of the data allow.