Forum: Jacob's Hideout BBS

DocLang

From Retrograde@3:633/10 to All on Tuesday, June 16, 2026 02:53:50

From the �lacking intelligence enough to parse SGML� department:
Feed: www.theregister.com - Articles
Title: A modest proposal: Reformat everything to make documents more
palatable to AI
Date: Mon, 15 Jun 2026 23:23:21 +0000
Link: https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938

Image[1]

Websites are being redesigned for consumption by AI models, and now a
coalition wants to extend the trend to digital documents. The LF AI &
Data Foundation, under the Linux Foundation, has formed a working group
to steer the development of DocLang, an AI-friendly document format that
aims to help enterprises feed their files to AI systems. The DocLang
group, founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, contends that existing formats like PDF, Markdown, HTML, and LaTeX are ill-suited for AI document parsing. In late 2024, IBM developed an open
source toolkit called Docling to facilitate AI document parsing, not
unlike Microsoft's MarkItDown or the Marker project. Docling provides a
way to convert various file formats into structured AI-ready data.
DocLang expands upon that foundation with a standard for exchanging
structured output across different systems. "DocLang is designed to
solve one of the foundational problems in enterprise AI: documents were
built for humans, not machines," said Maxime Vermeir, VP of AI Strategy
at AI automation biz ABBYY in a statement. "By introducing a minimal, standardized, and AI-native representation of document structure,
layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems." The new DocLang format is necessary,
the spec authors argue, because existing formats were designed for
rendering and lose semantic information, structural relationships, or
geometric context when AI models turn them into tokens. The
specification explains that Markdown lacks sufficient scope, that HTML
is excessively verbose, and that LaTeX allows too much ambiguity.
Essentially, DocLang is optimized for LLM tokenizers through markup that
maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec
relies on a limited XML vocabulary that aligns with LLM tokenizers to
produce optimized prompts. It is lossless, so the AI conversion doesn't
do away with valuable info. It's designed to support common graphical
elements like tables, formulas, charts, and multimodal content. And it's
an open standard. DocLang could also help keep costs under control.
According to AI Cost Check, having an AI model conduct an OCR scan on a
PDF requires about 1,200 input tokens and 150 output tokens as a
baseline. That's inconsequential to corporate AI customers on a one-off
basis but demands attention at scale. And because AI models have highly variable token costs, companies may find they are spending more than
they anticipated to have their AI system ingest PDFs, particularly if
the documents are long and complicated or an expensive frontier model is
used. "PDFs were designed for rendering, not understanding," said Jon
Knisley, AI Value and Enablement Lead at ABBYY, in an email to The
Register. "Every time a PDF enters an AI pipeline, structure, meaning
and layout get lost, so the model's accuracy ends up bottlenecked by
document quality rather than model quality. Teams compensate by building
custom parsers at every integration point, which results in brittle,
one-off work, and a new engineering sprint for every new document type." According to Knisley, that has measurable cost. "Ambiguous structure
forces the model into guesswork, which drives up hallucination risk and
burns tokens deciphering layout instead of extracting meaning," he
explained. "With DocLang, customers can expect better accuracy, lower
costs, fewer tokens consumed, faster performance and more consistent
outputs. The exact savings depend on the use case and document
complexity, but our initial benchmarks show 4x to more than 30x lower
cost depending on the model evaluated." Knisley also cited governance advantages, noting that document provenance data and metadata can get
stripped when documents gets moved. DocLang, he said, keeps that
information attached. ABBYY, which offers AI document processing, has
created the DocLang Interactive Benchmark to illustrate the potential
token savings of feeding DocLang documents to AI models. A PDF of IBM's
2025 annual report, for example, results 8,421 input tokens and 512
output tokens while a DocLang version requires only 5,310 input tokens
and 498 output tokens. What's more, the DocLang version results in lower latency (2.7s vs 4.2s) and delivers better quality (the AI missed one subsection and mangled a table merger in the PDF). "It's still early,
and we won't overstate adoption," said Knisley. "The standard is open
and free to build on, and the group is actively inviting more technology providers and enterprises to join. The early response has been
encouraging, and we're optimistic about where it goes from here." ?

Links:
[1]: https://image.theregister.com/?imageId=5255961&width=800 (image)

--- PyGate Linux v1.5.16
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Lawrence D?Oliveiro@3:633/10 to All on Tuesday, June 16, 2026 03:40:21

On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:

[from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
"DocLang is designed to solve one of the foundational problems in
enterprise AI: documents were built for humans, not machines,"

LOL at ?documents were built for humans, not machines?. What was the
?I? in ?AI? supposed to stand for, again?

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Computer Nerd Kev@3:633/10 to All on Wednesday, June 17, 2026 08:24:53

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:

[from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
"DocLang is designed to solve one of the foundational problems in
enterprise AI: documents were built for humans, not machines,"

LOL at "documents were built for humans, not machines". What was the
"I" in "AI" supposed to stand for, again?

From what I've seen, it's definitely "Idiot".

--
__ __
#_ < |\| |< _#

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Scott Dorsey@3:633/10 to All on Tuesday, June 16, 2026 19:17:43

Computer Nerd Kev <not@telling.you.invalid> wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:

[from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
"DocLang is designed to solve one of the foundational problems in
enterprise AI: documents were built for humans, not machines,"

LOL at "documents were built for humans, not machines". What was the
"I" in "AI" supposed to stand for, again?

From what I've seen, it's definitely "Idiot".

Why can't people just use TeX markup like God and Knuth intended?
--scott
--
"C'est un Nagra. C'est suisse, et tres, tres precis."

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Lawrence D?Oliveiro@3:633/10 to All on Wednesday, June 17, 2026 07:35:52

On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

Why can't people just use TeX markup like God and Knuth intended?

Because troff came first.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Scott Dorsey@3:633/10 to All on Wednesday, June 17, 2026 18:41:06

In article <110tion$1mg7k$2@dont-email.me>,
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> wrote:

On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

Why can't people just use TeX markup like God and Knuth intended?

Because troff came first.

troff was just an updated runoff. TeX was a different order of magnitude;
it was up with commercial typesetting systems like Xics.
--scott
--
"C'est un Nagra. C'est suisse, et tres, tres precis."

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Bob Eager@3:633/10 to All on Wednesday, June 17, 2026 22:44:53

On Wed, 17 Jun 2026 18:41:06 -0400, Scott Dorsey wrote:

In article <110tion$1mg7k$2@dont-email.me>,
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> wrote:

On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

Why can't people just use TeX markup like God and Knuth intended?

Because troff came first.

troff was just an updated runoff. TeX was a different order of
magnitude;
it was up with commercial typesetting systems like Xics.
--scott

troff was a development of roff, which included some typesetting features. roff was named as a UNIX-style (shorter word) version of DEC's Runoff.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Lawrence D?Oliveiro@3:633/10 to All on Wednesday, June 17, 2026 23:54:56

On Wed, 17 Jun 2026 18:41:06 -0400 (EDT), Scott Dorsey wrote:

On Wed, 17 Jun 2026 07:35:52 -0000 (UTC), Lawrence D?Oliveiro wrote:

On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

Why can't people just use TeX markup like God and Knuth intended?

Because troff came first.

troff was just an updated runoff. TeX was a different order of
magnitude; it was up with commercial typesetting systems like Xics.

troff appears: 4th ed Unix, 1973 <https://wiki.tuhs.org/doku.php?id=systems:4th_edition&s[]=troff>.
Originally designed for the CAT phototypesetter from 1972 <https://en.wikipedia.org/wiki/Troff>.

TEX -- didn?t come out till 1978, if I recall rightly.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

Who's Online
Recent Visitors
- Wang Bu
  Sunday, June 14, 2026 19:13:00
  from Manila, Philippines via Telnet
- Wang Bu
  Sunday, May 24, 2026 21:32:28
  from Manila, Philippines via Telnet
- Wang Bu
  Monday, May 18, 2026 09:25:45
  from Manila, Philippines via Telnet
- Wang Bu
  Thursday, May 14, 2026 00:10:16
  from Manila, Philippines via Telnet

System Info

Sysop:	Jacob Catayoc
Location:	Pasay City, Metro Manila, Philippines
Users:	4
Nodes:	4 (0 / 4)
Uptime:	494930:45:04
Calls:	162
Files:	568
D/L today:	14 files (349K bytes)
Messages:	75,011

DocLang

Who's Online

Recent Visitors

System Info