John Warnock, the PDF boss

Co-founder of Adobe Systems, John Warnock imagined the PDF to simplify the reading and exchange of documents. His invention has become one of the world’s most widely used computer formats.

PDF, for Portable Document Format, is a digital file format developed by the American computer company Adobe Systems.

Its main feature is preserve the layout of a source document: regardless of the terminal, operating system and software used to create, display or print a PDF, or the fonts used, images and other document elements are restored to their original state. As Bob Wulff, Vice President of Engineering at Adobe,points out, “The PDF format enables the user to see exactly – down to the pixel – what the author of a file has planned.

Thanks to this specificity, PDF is an efficient and reliable means of exchanging and consulting electronic documents, which meets both the needs of the general public and the requirements of institutions or companies.

The format developed by Adobe is based on three technologies: a variation of the PostScript page description language, designed by the same company, is used to generate the page layout and graphic elements; a font integration and replacement system allows fonts to “follow” the document; a structured storage method enables all this data and associated content (drawings, photos, multimedia content, etc.) to be grouped and stored in a single file. As such, PDF is “presentation-oriented”, unlike HTML or XML files.

From Camelot to Acrobat

The PDF format is the brainchild of John Warnock, co-founder of Adobe Systems alongside Charles Geschke. In the early 1990s, each operating system (Mac, Windows, MS DOS, Unix) had its own way of operating and interpreting electronic files. It’s impossible “to exchange information between machines, systems and users in such a way as to guarantee that the file looks the same wherever it goes,” says Leonard Rosenthol, PDF architect at Adobe.

In August 1990, Warnock laid the foundations for the PDF in an article entitled “ The Camelot Project”. He explains his vision:

“What industries desperately need is a universal means of communicating documents across a wide variety of machine configurations, operating systems and communications networks. These documents need to be viewable on any screen and printable on any modern printer. If this problem can be solved, then the way people work will change fundamentally.”

The work of Adobe’s project team came to fruition a year later, and the invention was announced at the Seybold Conference on Computer Publishing, held in San José in October 1991.

Originally, the PDF software was codenamed “Carousel” (hence the existence of .caro files). But the trademark had already been registered by the Eastman Kodak Company, which used it for a slide projector. It was finally under the name “Adobe Acrobat” that the product was presented at the Comdex computer show in autumn 1992, where it won the Best of Comdex Award.

Adobe Acrobat version 1.0 was officially launched on June 15, 1993. The marketing campaign, which includes an eight-page ad in the Wall Street Journal, is aimed primarily at businesses, and emphasizes the paper savings achievable with PDF technology.

A video presenting Adobe Acrobat in 1993: The Office before its time.

Failure to take off

The PDF was not an immediate success, far from it. “When Acrobat was announced, the world didn’t get it. People didn’t understand how important sending documents electronically was going to become,” John Warnock recalled in an interview with the university journal Knowledge@Wharton.

At the time, its creation competed with other formats with similar ambitions, such as DjVu, Envoy, Common Ground Digital Paper and Farallon Replica. Even PostScript, conceived by Warnock and Geschke when Adobe was founded in 1982, is overshadowing the newcomer.

In its first version, PDF also suffers from technical weaknesses: it only recognizes the RGB colorimetric mode, which rules out professional use in prepress, and its weight is greater than that of a simple text file, which implies a considerably longer download time.

Above all, Acrobat Reader 1.0, essential for displaying PDF files, is not cheap: it costs between $35 and $50. The IRS may purchase a license to distribute the software to its employees, but the pricing policy condemns the PDF to confidential distribution.

The [d’Adobe] board wanted to bury it, ” admitted Warnock. I said, ‘No way. This is about solving an important problem, and we’re going to hang on until it works. ” “

Adobe then took a radical decision to save its format: starting with version 2.0, launched in September 1994, Acrobat Reader became free. Only PDF creation and editing software will continue to be charged for.

Eight versions in thirteen years

Over the next decade, Adobe set about perfecting its invention and adapting it to technological developments. With each new version, PDF is enriched with new features.

PDF version	Launch date	Main new features
1.0 (Acrobat 1)	1993	– Text – Images – Pages – Hyperlinks – Bookmarks – Vignettes
1.1 (Acrobat 2)	1994	– Password protection – Articles – Comments – External links – Output device-independent colors – Binary format for lighter files
1.2 (Acrobat 3)	1996	– Forms – Interactive elements (radio buttons, checkboxes) – Video and sound – Chinese, Korean and Japanese language support – CMYK color space management – Plug-in for opening PDFs with a web browser
1.3 (Acrobat 4)	1999	– Annotations – Digital signatures – Accompanying colors – Web capture (conversion of HTML pages to PDF) JavaScript support
1.4 (Acrobat 5)	2001	– Transparency and overprint management – 128-bit encryption – Collaborative working
1.5 (Acrobat 6)	2003	– Multitrack file support – Compression enhancement
1.6 (Acrobat 7)	2005	– 3D data integration – PDF batches (PDFs containing several individual files)
1.7 (Acrobat 8)	2006	– Enabling forms in Adobe Reader – Improved commentary, encryption and 3D animations – Predefined print parameters (paper, number of copies, print scale, etc.)

The road to standardization

In 2007, PDF became a de facto standard. While it is already an open format, its specifications being public and freely implemented, Adobe wishes to go further and has announced its intention to standardize it.

The ISO 32000-1 standard, which incorporates PDF version 1.7, was published on July 1, 2008. Since then, the official evolution of the format has depended on the technical committee of the International Organization for Standardization, of which Adobe is only a member.

In July 2017, PDF 2.0 saw the light of day, under the name ISO 32000-2. This update introduces technical improvements (encryption, annotations, accessibility, 3D…), removes obsolete elements (XFA forms, multimedia content…) and eliminates all proprietary technology from the specifications. A revision of PDF 2.0 in December 2020 clarifies, specifies and updates the standard.

PDF subformats are also standardized by ISO, each corresponding to specific needs: the PDF/A (“Archive”) for archiving and long-term preservation of digital documents, the PDF/X (“eXchange”) for printing and graphics production, the PDF/VT (“Variable and Transactional”) for high-volume personalized printing, the PDF/E (“Engineering”) for engineering and PDF/UA (“Universal Access”) for disabled access.

Indispensable OCR

Today, PDF is widely used for both professional and personal purposes. In keeping with its original vocation, it facilitates the human reading of documents, freeing us from the constraints of the medium. But while the format has become ubiquitous, it also has a crucial shortcoming in terms of today’s computing ambitions: it’s particularly difficult for a machine to read. To automate PDF processing, artificial intelligence is essential.

True PDFs” are those created digitally using software, and include a text layer from which elements can be extracted. However, these depend on the type of PDF, and sometimes include color management guidelines or support for embedded fonts – all of which can interfere with automation.

Scanned paper documents and images saved as PDFs are “image” PDFs, requiring Optical Character Recognition (OCR) to extract their content.

A PDF that has undergone an OCR process becomes “searchable”: it has two layers, the first with the image, the second with the text, which can now be easily manipulated. Once the file has been structured in this way, the machine will have no trouble processing it.

Sources: Adobe, PDF Association, Ernie Smith / Tedium, Wharton / University of Pennsylvania, ISO, “Intelligent Document Processing – Methods and Tools in the real world” by Graham A. Cutting and Anne-Françoise Cutting-Decelle.

Photo credits: Marvalous via Wikimedia Commons