Appraisal documents contain valuable information for analytics and decision-making at various steps of the mortgage process. Extracting and standardizing the data embedded in these documents is the first step but requires automation to avoid manual data entry. PropMix’s digitization solution uses image processing, OCR, and deep learning to process many common appraisal forms to produce MISMO XML from them. We can process PDFs containing parseable content (first generation PDFs) or scanned images (second or higher generation PDFs).
Built on decades of experience in image processing and Artificial Intelligence, our intelligent OCR (Optical Character Recognition) process can extract data from any document.
Appraisal Digitization Challenges and Solutions
Processing both first and second-generation appraisal documents raise certain interesting challenges:
-
We need to determine the beginning of the form within the PDF. For example, the first page of the 1004 form need not be at the first page of the document.
-
Each page of the form has a specific form and certain pages such as the comparable grid can repeat anywhere in the document - within the core form pages or a comparable grid might appear separately as the 24th page of a 30-page document. A deep learning model is used to classify the types of form pages to identify them anywhere in the document.
-
All pages of appraisal forms contain checkboxes and each appraisal forms software has a different way to fill the checkboxes. In addition, with the choice of different fonts for the forms, the checkboxes appear in a variety of ways and will not be picked by an OCR capability. We use an elaborate combination of image processing techniques to determine checkbox boundaries and to know if a box is checked or not.
- One other unique problem within appraisal forms is the use of table structures to represent information. The digitization solution needs to comprehend the different columns of data and properly demarcate the columns to know where certain cells of the table are empty and to perform the appropriate data type checks for the data extracted. A combination of image and signal processing techniques helps us identify table boundaries, column and row boundaries to appropriately bucket and map each data item extracted.
In addition to the above, second or later-generation PDFs pose certain more complex issues because the text in these documents is not parseable and instead we only have images for each page of the form. We rely on our OCR engine to extract the text from such pages and then process the text through a combination of heuristics, statistical models, and machine learning techniques to determine fields and field values. For example, the OCR engine might extract an adjustment value as “\^ $ 6000 |”. But since our processing has mapped the field to an adjustment field we expect to see an amount and so we can deduce that the value must be “$6,000”. Similarly, rules apply to most of the fields including certain higher-level data checking, ex: Census Tract Ids, Flood Zone indicators, Dates, etc.
Digitization at Scale
With all of the complexity explained above, extracting reliable data from an appraisal form is a highly intense process. Thanks to our completely scalable cloud-based platform hosted on AWS we can easily scale to handle high volumes. The system automatically adds more servers into our compute clusters in response to increasing volume so that we can maintain response times within our committed SLA limits.
We can process most common documents within 5 minutes. Processing time can be slightly higher for large (over 30MB) second or later generation PDF documents or documents containing more than 40 pages.
In addition to handling high volumes, the digitization solution is also designed ground up to scale functionally to handle new types of appraisal documents. The system currently supports the following:
- 1004 UAD forms for Single Family appraisals
- 1073 UAD for Condominiums
All the data extracted from any form is standardized into a common appraisal data model which is reused for all property types – SFR, Condo, 2-4 Unit Multi-Family, Manufactured Homes, etc. This allows us to easily generate any target data format from the standardized data model. The output data format out-of-the-box is MISMO 2.6 GSE, but we can also generate any other custom format as required.
Digitization Quality Control
There are primarily two challenges to ensure the quality of the data produced from the appraisal documents:
- Handling various different quality levels of second or later-generation PDF documents. If the scan quality of a document is poor the OCR process may not be able to extract useful information.
Our unique combination of image processing, OCR, and deep learning helps us handle a wide range of document qualities.
- Ensuring the consistency of data within the appraisal form using quality checks.
We check for consistency within the extracted data using a set of rules. For example, the adjustment numbers need to be mathematically consistent, subject property data needs to be consistent between site/improvement sections and the comparable grid, dates need to be consistent – ex: the effective date of the appraisal vs. the signature date.
Each appraisal is assigned a data quality score after the extraction is completed and if the document does not achieve a target data quality score it will be disqualified for delivery and instead, we would report an error to the client. Such discipline of quality control has helped improve the reliability of the PropMix digitization solution.
Try it now https://propmix.io/appraisal-digitization