Building a Better News Engine: The 7 Categories We Need
Open a search engine after a big story. Ten links come back and they look like ten different perspectives. They're usually the same, ten times, and we want to help you fix this.
Unbubble Hub is an Open Research Initiative that provides a space for researchers and engineers to come together and collaborate in developing tools to fight social polarization.
Sources is a GitHub repository (a piece of code) that takes a news event and returns sources, categorized and ranked, representing a range of diverse viewpoints.
Giorgio Catalani (find him on LinkedIn) started his career as a journalist and slowly moved closer to technology. He works as a sr product manager for a major publishing group; everyday, he tries to make life easier for editors and journalists. Big internet nerd, movie buff and wannabe cook. This is his first article for Unbubble Hub.
Suppose the European Central Bank (ECB) announced this week that it was holding interest rates steady; a decision met, as these always are, with disagreement about whether it was too cautious, too late, or exactly right.
Suddenly, the news is everywhere. A reader, wanting to understand what is going on and seeking the broadest possible perspective, turns to a news aggregator, a search engine, or an AI assistant. Ten results come back. They arrive from ten different outlets, across four political leanings, in three languages.
If those ten results are wire reports - e.g Reuters, Ansa, Ap - they will be strikingly homogeneous in stance: the rate, the vote, the statement, the market reaction, and a quote from Christine Lagarde. As a set, they may do a reasonable job covering the key verifiable facts, with only subtle differences in what gets foregrounded. If, instead, the ten results are op-eds, they will look gloriously diverse in stance: “the ECB is choking the recovery,” “the ECB is fighting the last war,” “the ECB is right to wait.” Yet, they will be redundant on fact, because op-ed writers all argue from the exact same narrow set of available data.
Foto di Evie S. su Unsplash
The News Ecosystem
A breaking news item and an opinion column both shape public perception, but through fundamentally different mechanisms. A breaking news item shapes perception by selecting which facts to present and which to omit; an opinion column shapes it by arguing what those facts mean, who is favored by the decision, and what the repercussions might be.
The worst disservice we can do to ourselves as readers is to compare these outputs as equals. It is fairly easy to label them as “subjective vs. objective” or “biased vs. unbiased,” but that misses the point: they are completely different products engineered for entirely different needs.
Journalism studies has known this for decades. Work on Western newspapers tracing back to the 1960s highlights this distinction across national systems, proving that the news item, the interpretive report, and the commentary are distinct genres governed by different norms and yielding different effects.1
What is strange is that almost none of this nuance has crossed over into the systems currently deciding what gets read. News aggregators, search engines, LLMs, and “compare coverage” features treat an article as an article, and a link as a link. They adjust for source reputation, recency, and language, but the actual genre of the article remains invisible to them. A reader looking for the broadest possible perspective on a fact is gambling, hoping an AI will solve a problem it was fundamentally not built to solve.
This is the exact problem we are trying to solve with Sources.
What Publishers Already Know
Publishers already encode these distinctions into their infrastructure. They assign different genres to different desks and templates. They mark bylines differently: a political editor writes analysis, a columnist writes opinion, a fact-checking unit handles verifications. Most importantly, this architecture is visible in the metadata, often right in the URL: /news/, /analysis/, /opinion/, /fact-check/, /guide/.
Publishers do not do this to help algorithms; they do it because they are speaking to two different audiences simultaneously:
The Reader: Scans titles and bylines to gauge intent.
The Platform: (Google, Apple News, AI search) Reads URLs and structured metadata.
A URL containing /opinion/ is a direct message sent to the algorithm: This is an argument, not a report. Weight it accordingly. A URL with /fact-check/ signals: This is a verification, not a claim.
These signals are practically free. They are already produced by the publisher and baked into the page source. Yet, the machines deciding what to show you mostly ignore them.
There is, however, a real complication. Research shows that journalistic cultures vary wildly. Italian newspapers, for example, blend opinion into news at significantly higher rates than their British or German counterparts, while Spanish and French outlets sit somewhere in the middle.2 This is not a bug to design around; it is a feature of different journalistic norms regarding where the line between reporting and interpretation sits. Any system attempting to classify articles by URL and title will naturally work better in some languages than others. This requires careful calibration, not assumption. But it is still a better starting point than pretending the signal doesn’t exist.
A Seven-Type Framework to Define the Ecosystem
To operationalize this, I am proposing a first-pass taxonomy: seven article types, separated by their core intent, and identifiable (albeit imperfectly) by URL and metadata signals.3
To make this concrete, let’s watch our hypothetical week of ECB coverage refract through each type:
1. The Breaking News Item: A concise, fact-focused report produced under time pressure. The lead answers who, what, where, and when. Sources are cited as authorities, not interlocutors. No thesis is advanced.
Example: “ECB holds rates at 3.5%, cites persistent core inflation.”
Signals:
/news/,/world/,/economy/; wire origin; staff/newsroom byline.
2. The News Analysis: Goes beyond current facts to speculate on significance, outcomes, and motives. It remains journalism, making empirical claims rather than normative judgments. It answers why at length.
Example: “Why Lagarde signalled caution despite cooling headline inflation.”
Signals:
/analysis/,/news-analysis/; senior correspondent or bureau chief byline.
3. The Opinion or Editorial: Explicitly exercises normative or evaluative judgment. The author takes a position and argues for it, often using first-person and evaluative language (”should”, “must”, “dangerous”). The purpose is to persuade.
Example: “The ECB is choking the recovery—and Italy will pay first.”
Signals:
/opinion/,/commentary/,/editorial/,/op-ed/.
4. The Explainer: Pedagogical content that provides historical, institutional, or structural context. It answers “what is…” or “how does X work.” Often evergreen.
Example: “How does the ECB actually decide where to set rates?”
Signals:
/explainer/,/guide/,/background/; didactic formatting (FAQs, lists).
5. The Interview or Testimony: Dedicates extended space to a single voice (e.g., Q&A, profile, long monologue). The subject is the focal point of the piece.
Example: A Q&A with a former central banker on the rate decision.
Signals:
/interviews/,/voices/,/profiles/,/long-reads/.
6. The Fact-Check: Evaluates the truth of a specific public claim. Structured logically: claim → evidence → verdict.
Example: “Did inflation really drop as much as the government is claiming?”
Signals:
/fact-check/,/verifica/; dedicated fact-checking outlets (Full Fact, Reuters Fact Check).
7. The Wire Republication: Reproduces a wire service report with minimal to no editing. Identical across multiple outlets.
Example: The exact same Reuters story run verbatim across six Italian dailies.
Signals: Explicit wire credit; identical headlines across properties. (Note: This category is the most fluid; it may function better as a “provenance tag” applied to Type 1 rather than a standalone genre).
What is Next for “Sources”?
Our goal is to operationalize the concept of “perspective”: to define it with enough precision that a system can act on it. If our mission is to provide the broadest possible perspective on a given fact, this framework might play a substantial role.
That being said, there are some open questions that we still want to address:
Does the framework classify accurately enough to use? We will run the taxonomy against current Sources output for a representative news event and compare the distribution against hand-labeled ground truth. The ultimate stress test will be the Italian media ecosystem, where the blend of reportage and interpretation will require heavy calibration.
Can the framework define a useful subset of sources? Our current ranker selects the top ten sources based on relevance and five existing diversity dimensions. If we add genre type as a pre-filter, does it produce a subset that is materially better? Specifically: is it better balanced between fact-supply and argument-supply?
Does this actually enhance perspective? Adding type might make the system’s output measurably more diverse, or it might just add an invisible layer of structure that readers ignore.
Ultimately, we believe that you cannot fix the blindness of algorithmic retrieval just by feeding the machine more links. You have to teach the machine what kind of text it is looking at. We don’t have all the answers yet, and this first version is a starting point, not a solution. We’ll publish what we learn here as it comes, including the parts that break.
Frank Esser and Andrea Umbricht: Refers to their comparative research, notably “The Evolution of Objective and Interpretative Journalism in the Western Press” (2014), which analyzed print journalism trends across Western nations from the 1960s onward, defining the shift from pure objectivity to interpretative reporting.
Journalistic Cultures: Esser and Umbricht’s work highlights the “polarizationized pluralist” model of Southern Europe (like Italy), where journalism is historically more partisan, commentary-driven, and literary, compared to the “democratic corporatist” or “liberal” models of Northern Europe and the Anglosphere.
Carsten Reinemann / Susana Salgado & Jesper Strömbäck: Reinemann is recognized for foundational work on distinguishing “hard” vs “soft” news and political journalism frameworks. Salgado and Strömbäck’s operational definitions (e.g., “Interpretative journalism: A review of concepts, operationalizations and key findings”, 2011) are vital here, as they established the measurable boundaries between fact-reporting and journalistic interpretation—the exact hinge between your Type 1 (News) and Type 2 (Analysis).


