There are several best practices recommended by DoxTek to ensure the highest quality possible. Below, we have listed several of the best practices that can be implemented within your solution. While not all of these are applicable in every situation, by investing time in the scan process, you will see a higher level of accuracy throughout your entire document processing solution.
When scanning, we recommend the following settings:
DPI: 300 dpi
Color Configuration: Black and white
However, simply scanning with the settings mentioned above can still leave you with an image that is hard to perform Optical Character Recognition (OCR), leading to less reliable data. This makes it more difficult to extract the data or perform search operations on scanned images.
Consider implementing some of the following, whether through your scanner solution or manually as you import or scan your files:
When an image is scanned it is placed in/on the scanner without being flush; this leads to a crooked image. By deskewing the image, the image is now in the proper orientation.
Example Document: Thin paper (like a receipt) that is difficult to align with scanner edges.
Especially with smaller files, the scanned image can have a black background that is irrelevant and should be removed to not distract from the content of the actual document.
Example Document: Small form that is not your typical 8.5 x 11
Some documents contain shading (especially with tables) to assist the human eye in distinguishing between different rows. However, this shading leads to decreased image quality once scanned and should be removed.
Example Document: Invoice with a numerical table
An image may contain speckles, dots, or small images that can detract from the actual data of the document.
Example Document: A document that has accidental speckles
Oftentimes, the print on the document may not be sufficiently crisp and make it hard for a computer to differentiate between individual characters. As seen below, OCR could easily confuse an ‘n’ for an ‘m’ when there is no clear stop and start between characters.
Example Document: Older document or document that was printed with low amounts of ink
Horizontal Line Removal
When scanning documents with tables on them, it is often helpful to remove unnecessary lines that are traditionally used for human readability. See the section for Vertical Line Removal below.
Example Document: Invoice
Vertical Line Removal
When scanning documents with tables on them, it is often helpful to remove unnecessary lines that are traditionally used for human readability. See the section for Horizontal Line Removal above.
Example Document: Invoice
Random lines within a document may cause a certain streak to be confused for a different character. Obviously, a lowercase ‘l’ could be confused with a ‘T’ if a perpendicular streak is near the top of the ‘l’.
Example Document: Poorly scanned document or a document that had some debris (dirt, hair, etc.) on it at scan time.
Watermarks & Logos
It is common for certain documents, such as university transcripts, to have logos or watermarks on the document. Ensure that these marks are removed so as not to interfere with the data of the document. Although a ‘grayed’ out logo or watermark may not interfere with a human’s ability to read the document, this can affect how the document’s data is extracted during compression time.
Example Document: The draft version of a legal document
Keep in mind that each of these settings have their advantages and disadvantages. Test the different settings and see the impact they have on the document being processed by comparing the anticipated results with the extracted results. If you have any further questions regarding best practices, please reach out to your first line of support for additional information.