Reliable and intuitive text annotation
textada combines established functionality with an AI assistant to save you time and money - no programming required. Our assistant suggests annotations of high quality because it learns from the best – you and your colleagues. Your requirements are our top priority: you decide which suggestions to use.
Reliable and intuitive text annotation
textada combines established functionality with an AI assistant to save you time and money - no programming required. Our assistant suggests annotations of high quality because it learns from the best – you and your colleagues. Your requirements are our top priority: you decide which suggestions to use.
The most efficient way to annotate
textada is an easy-to-use software that helps you get started right away in small to large and easy to difficult annotation projects. Developed by and for researchers, textada is powerful yet user-friendly, allowing you to create data of the highest quality.
We are an interdisciplinary team with backgrounds in computer science, data science, sociology, and political science. Years of research (more than 470 citations and multiple conference awards) in these fields have made us experts in automated text analysis, manual content analysis, and the integration of techniques and knowledge across these scientific disciplines.
Our goal is to make your work easier by building an annotation tool that enables users to enjoy the latest AI technologies while not requiring any programming skills. We strive to lower your annotation effort and cost by reducing manual work as much as possible while relying on your expertise in the assisted annotation.

Felix is an expert in AI technology who has contributed to state-of-the-art AI solutions both in academia and industry, such as IBM Watson. After completing research stays in Silicon Valley and Tokyo, his doctoral research focused on automated text analysis for media bias identification through interdisciplinary integration of techniques and expertise.

Moritz has more than nine years of experience in developing web-based systems, UI-Design, and UX-Development. During his studies, he delved into natural language processing and collaborative software development, which he continues to apply in several open-source projects.
Our publications
Abstract
Extensive research on target-dependent sentiment classification (TSC) has led to strong classification performances in domains where authors tend to explicitly express sentiment about specific entities or topics, such as in reviews or on social media. We investigate TSC in news articles, a much less researched domain, despite the importance of news as an essential information source in individual and societal decision making. This article introduces NewsTSC, a manually annotated dataset to explore TSC on news articles. Investigating characteristics of sentiment in news and contrasting them to popular TSC domains, we find that sentiment in the news is expressed less explicitly, is more dependent on context and readership, and requires a greater degree of interpretation. In an extensive evaluation, we find that the current state-of-the-art in TSC performs worse on news articles than on other domains (average recall AvgRec = 69.8 on NewsTSC compared to AvgRev = [75.6, 82.2] on established TSC datasets). Reasons include incorrectly resolved relation of target and sentiment-bearing phrases and off-context dependence. As a major improvement over previous news TSC, we find that BERT’s natural language understanding capabilities capture the less explicit sentiment used in news articles.
Abstract
Media bias and its extreme form, fake news, can decisively affect public opinion. Especially when reporting on policy issues, slanted news coverage may strongly influence societal decisions, e.g., in democratic elections. Our paper makes three contributions to address this issue. First, we present a system for bias identification, which combines state-of-the-art methods from natural language understanding. Second, we devise bias-sensitive visualizations to communicate bias in news articles to non-expert news consumers. Third, our main contribution is a large-scale user study that measures bias-awareness in a setting that approximates daily news consumption, e.g., we present respondents with a news overview and individual articles. We not only measure the visualizations’ effect on respondents’ bias- awareness, but we can also pinpoint the effects on individual components of the visualizations by employing a conjoint design. Our bias-sensitive overviews strongly and significantly increase bias-awareness in respondents. Our study further suggests that our content-driven identification method detects groups of sim- ilarly slanted news articles due to substantial biases present in individual news articles. In contrast, the reviewed prior work rather only facilitates the visibility of biases, e.g., by distinguishing left- and right-wing outlets.
Abstract
Large amounts of annotated data have become more important than ever, especially since the rise of deep learning techniques. However, manual annotations are costly. We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations, thus strongly reducing annotation cost and effort. For this purpose, we combine an active learning (AL) approach with a pretrained language model to semi-automatically identify annotation categories in the given text documents. To highlight our research direction’s potential, we evaluate the approach on the task of identifying frames in news articles. Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even these complex and subtle frames. On the framing dataset, the AL approach needs only 16.3% of the annotations to reach the same performance as a model trained on the full dataset.
Abstract
Slanted news coverage, also called media bias, can heavily influence how news consumers interpret and react to the news. To automatically identify biased language, we present an exploratory approach that compares the context of related words. We train two word embedding models, one on texts of left-wing, the other on right-wing news outlets. Our hypothesis is that a word’s representations in both word embedding spaces are more similar for non-biased words than biased words. The underlying idea is that the context of biased words in different news outlets varies more strongly than the one of non-biased words, since the perception of a word as being biased differs depending on its context. While we do not find statistical significance to accept the hypothesis, the results show the effectiveness of the approach. For example, after a linear mapping of both word embeddings spaces, 31% of the words with the largest distances potentially induce bias. To improve the results, we find that the dataset needs to be significantly larger, and we derive further methodology as future research direction. To our knowledge, this paper presents the first in-depth look at the context of bias words measured by word embeddings.
Abstract
Media coverage possesses a substantial effect on the public perception of events. The way media frames events can significantly alter the beliefs and perceptions of our society. Nevertheless, nearly all media outlets are known to report news in a biased way. While such bias can be introduced by altering the word choice or omitting information, the perception of bias also varies largely depending on a reader’s personal background. Therefore, media bias is a very complex construct to identify and analyze. Even though media bias has been the subject of many studies, previous assessment strategies are oversimplified, lack overlap and empirical evaluation. Thus, this study aims to develop a scale that can be used as a reliable standard to evaluate article bias. To name an example: Intending to measure bias in a news article, should we ask, “How biased is the article?” or should we instead ask, “How did the article treat the American president?”. We conducted a literature search to find 824 relevant questions about text perception in previous research on the topic. In a multi-iterative process, we summarized and condensed these questions semantically to conclude a complete and representative set of possible question types about bias. The final set consisted of 25 questions with varying answering formats, 17 questions using semantic differentials, and six ratings of feelings. We tested each of the questions on 190 articles with overall 663 participants to identify how well the questions measure an article’s perceived bias. Our results show that 21 final items are suitable and reliable for measuring the perception of media bias. We publish the final set of questions on http://bias- question-tree.gipplab.org/.
Abstract
Media has a substantial impact on public perception of events, and, accordingly, the way media presents events can potentially alter the beliefs and views of the public. One of the ways in which bias in news articles can be introduced is by altering word choice. Such a form of bias is very challenging to identify automatically due to the high context-dependence and the lack of a large-scale gold-standard data set. In this paper, we present a prototypical yet robust and diverse data set for media bias research. It consists of 1,700 statements representing various media bias instances and contains labels for media bias identification on the word and sentence level. In contrast to existing research, our data incorporate background information on the participants’ demographics, political ideology, and their opinion about media in general. Based on our data, we also present a way to detect bias-inducing words in news articles automatically. Our approach is feature-oriented, which provides a strong descriptive and explanatory power compared to deep learning techniques. We identify and engineer various linguistic, lexical, and syntactic features that can potentially be media bias indicators. Our resource collection is the most complete within the media bias research area to the best of our knowledge. We evaluate all of our features in various combinations and retrieve their possible importance both for future research and for the task in general. We also evaluate various possible Machine Learning approaches with all of our features. XGBoost, a decision tree implementation, yields the best results. Our approach achieves an F₁-score of 0.43, a precision of 0.29, a recall of 0.77, and a ROC AUC of 0.79, which outperforms current media bias detection methods based on features. We propose future improvements, discuss the perspectives of the feature-based approach and a combination of neural networks and deep learning with our current system.
Abstract
Many people consider news articles to be a reliable source of information on current events. However, due to the range of factors influencing news agencies, such coverage may not always be impartial. Media bias, or slanted news coverage, can have a substantial impact on public perception of events, and, accordingly, can potentially alter the beliefs and views of the public. The main data gap in current research on media bias detection is a robust, representative, and diverse dataset containing annotations of biased words and sentences. In particular, existing datasets do not control for the individual background of annotators, which may affect their assessment and, thus, represents critical information for contextualizing their annotations. In this poster, we present a matrix-based methodology to crowdsource such data using a self-developed annotation platform. We also present MBIC (Media Bias Including Characteristics) - the first sample of 1,700 statements representing various media bias instances. The statements were reviewed by ten annotators each and contain labels for media bias identification both on the word and sentence level. MBIC is the first available dataset about media bias reporting detailed information on annotator characteristics and their individual background. The current dataset already significantly extends existing data in this domain providing unique and more reliable insights into the perception of bias. In future, we will further extend it both with respect to the number of articles and annotators per article.
Abstract
Unsupervised concept identification through clustering, i.e., identification of semantically related words and phrases, is a common approach to identify contextual primitives employed in various use cases, e.g., text dimension reduction, i.e., replace words with the concepts to reduce the vocabulary size, summarization, and named entity resolution. We demonstrate the first results of an unsupervised approach for the identification of groups of persons as actors extracted from a set of related articles. Specifically, the approach clusters mentions of groups of persons that act as non-named entity actors in the texts, e.g., “migrant families” = “asylum-seekers.” Compared to our baseline, the approach keeps the mentions of the geopolitical entities separated, e.g., “Iran leaders” ≠ “European leaders,” and clusters (in)directly related mentions with diverse wording, e.g., “American officials” = “Trump Administration.”
Abstract
Media bias describes differences in the content or presentation of news. It is an ubiquitous phenomenon in news coverage that can have severely negative effects on individuals and society. Identifying media bias is a challenging problem, for which current information systems offer little support. News aggregators are the most important class of systems to support users in coping with the large amount of news that is published nowadays. These systems focus on identifying and presenting important, common information in news articles, but do not reveal different perspectives on the same topic. Due to this analysis approach, current news aggregators cannot effectively reveal media bias. To address this problem, we present matrix-based news aggregation, a novel approach for news exploration that helps users gain a broad and diverse news understanding by presenting various perspectives on the same news topic. Additionally, we present NewsBird, an open-source news aggregator that implements matrix-based news aggregation for international news topics. The results of a user study showed that NewsBird more effectively broadens the user’s news understanding than the list-based visualization approach employed by established news aggregators, while achieving comparable effectiveness and efficiency for the two main use cases of news consumption: getting an overview of and finding details on current news topics.
Abstract
News is a central source of information for individuals to inform themselves on current topics. Knowing a news article’s slant and authenticity is of crucial importance in times of “fake news,” news bots, and centralization of media ownership. We introduce Newsalyze, a bias-aware news reader focusing on a subtle, yet powerful form of media bias, named bias by word choice and labeling (WCL). WCL bias can alter the assessment of entities reported in the news, e.g., “freedom fighters” vs. “terrorists.” At the core of the analysis is a neural model that uses a news-adapted BERT language model to determine target-dependent sentiment, a high-level effect of WCL bias. While the analysis currently focuses on only this form of bias, the visualizations already reveal patterns of bias when contrasting articles (overview) and in-text instances of bias (article view).
Abstract
Automated weighting is performed that includes transforming a behavior of each respective dimension of multiple dimensions of a selected group of events to a respective weight, the respective weight determined based on a distribution of values of the respective dimension, and where the weight determined for a first of the plurality of dimensions is greater than the weight determined for a second of the plurality of dimensions. Similarity values are computed indicating similarities between further events and the selected group of events, the similarity values based on a combination of the weights and distances between the further events and the selected group of events. Cohorts of the further events are generated by performing multi-level ranking that comprises ranking groups of the further events based on the similarity values, and applying merging to the groups to produce merged groups. The cohorts are visualized in a graphical visualization.
Abstract
Traditional media outlets are known to report political news in a biased way, potentially affecting the political beliefs of the audience and even altering their voting behaviors. Many researchers focus on automatically detecting and identifying media bias in the news, but only very few studies exist that systematically analyze how theses biases can be best visualized and communicated. We create three manually annotated datasets and test varying visualization strategies. The results show no strong effects of becoming aware of the bias of the treatment groups compared to the control group, although a visualization of hand-annotated bias communicated bias instances more effectively than a framing visualization. Showing participants an overview page, which opposes different viewpoints on the same topic, does not yield differences in respondents’ bias perception. Using a multilevel model, we find that perceived journalist bias is significantly related to perceived political extremeness and impartiality of the article.
Abstract
Media bias may often affect individuals’ opinions on reported topics. Many existing methods that aim to identify such bias forms employ individual, specialized techniques and focus only on English texts. We propose to combine the state-of-the-art in order to further improve the performance in bias identification. Our prototype consists of three analysis components to identify media bias words in German news articles. We use an IDF-based component, a component utilizing a topic-dependent bias dictionary created using word embeddings, and an extensive dictionary of German emotional terms compiled from multiple sources. Finally, we discuss two not yet implemented analysis components that use machine learning and network analysis to identify media bias. All dictionary-based analysis components are experimentally extended with the use of general word embeddings. We also show the results of a user study.
Abstract
Slanted news coverage, also called media bias, can heavily influence how news consumers interpret and react to the news. Models to identify and describe biases have been proposed across various scientific fields, focusing mostly on English media. In this paper, we propose a method for analyzing media bias in German media. We test differ- ent natural language processing techniques and combinations thereof. Specifically, we combine an IDF-based component, a specially created bias lexicon, and a linguistic lexicon. We also flexibly extend our lexica by the usage of word embeddings. We evaluate the system and methods in a survey (N=46), comparing the bias words our system detected to human annotations. So far, the best component combination results in an F1 score of 0.31 of words that were identified as biased by our system and our study participants. The low performance shows that the analysis of media bias is still a difficult task, but using fewer resources, we achieved the same performance on the same task than recent research on English. We summarize the next steps in improving the resources and the overall results.
Abstract
Dataset exploration is a set of techniques crucial in many research and data science projects. For textual datasets, commonly used techniques include topic modeling, document summarization, and methods related to dimension reduction. Despite their robustness, these techniques suffer from at least one of the following drawbacks: document summarization does not explicitly set documents in relation, the others yield summaries or topics that often are difficult to interpret and yield poor results for topics that consist of context-dependent terms. We propose a method for dataset exploration that employs cross-document near-identity resolution of mentions of semantic concepts, such as persons, other named entity types, events, actions. The method not only sets documents in relation and thus allows for comparative dataset exploration, but also yields well interpretable document representations. Additionally, due to the underlying approach for cross-document resolution of concept mentions, the method is able to set documents in relation as to their near-identity terms, e.g., synonyms that are not universally valid but only in the given dataset.
Abstract
Topic modeling is a technique used in a broad spectrum of use cases, such as data exploration, summarization, and classification. Despite being a crucial constituent of many use cases, established topic models, such as LDA, often produce statistically valid yet non-meaningful topics, i.e., that cannot easily be interpreted by humans. In turn, the usability of topic modeling approaches, e.g., in document summarization, is non-optimal. We propose a topic modeling approach that uses TCA, a method for also near-identity cross-document coreference resolution. TCA showed promising results when resolving mentions of not only persons and other named entities, but also broad, vague, or abstract concepts. In a preliminary evaluation on news articles, we compare the approach with state-of-the-art topic modeling. We find that (1) the four baselines produce statistically valid yet hollow topics or topics that only refer to events in the dataset but not the events’ topical composition. (2) TCA is the only approach that extracts topics that distinctively describe meaningful parts of the dataset.
Abstract
Event extraction from news articles is a commonly required prerequisite for various tasks, such as article summarization, article clustering, and news aggregation. Due to the lack of universally applicable and publicly available methods tailored to news datasets, many researchers redundantly implement event extraction methods for their own projects. The journalistic 5W1H questions are capable of describing the main event of an article, i.e., by answering who did what, when, where, why, and how. We provide an in-depth description of an improved version of Giveme5W1H, a system that uses syntactic and domain-specific rules to automatically extract the relevant phrases from English news articles to provide answers to these 5W1H questions. Given the answers to these questions, the system determines an article’s main event. In an expert evaluation with three assessors and 120 articles, we determined an overall precision of p=0.73, and p=0.82 for answering the first four W questions, which alone can sufficiently summarize the main event reported on in a news article. We recently made our system publicly available, and it remains the only universal opensource 5W1H extractor capable of being applied to a wide range of use cases in news analysis.
Abstract
Media bias, i.e., slanted news coverage, can strongly impact the public perception of topics reported in the news. While the analysis of media bias has recently gained attention in computer science, the automated methods and results tend to be simple when compared to approaches and results in the social sciences, where researchers have studied media bias for decades. We propose Newsalyze, a work-in-progress prototype that imitates a manual analysis concept for media bias established in the social sciences. Newsalyze aims to find instances of bias by word choice and labeling in a set of news articles reporting on the same event. Bias by word choice and labeling (WCL) occurs when journalists use different phrases to refer to the same semantic concept, e.g., actors or actions. This way, instances of bias by WCL can induce strongly divergent emotional responses from readers, such as the terms “illegal aliens” vs. “undocumented immigrants.” We describe two critical tasks of the analysis workflow, finding and mapping such phrases, and estimating their effects on readers. For both tasks, we also present first results, which indicate the effectiveness of exploiting methods and models from the social sciences in an automated approach.
Abstract
Media bias can strongly impact the individual and public perception of news events. One difficult-to-detect, yet powerful form of slanted news coverage is bias by word choice and labeling (WCL). Bias by WCL can occur when journalists refer to the same concept, yet use different terms, which results in different sentiments being sparked in the readers, such as the terms “economic migrants” vs. “refugees.” We present an automated approach to identify bias by WCL that employs models and manual analysis approaches from the social sciences, a research domain in which media bias has been studied for decades. This paper makes three contributions. First, we present NewsWCL50, the first open evaluation dataset for the identification of bias by WCL consisting of 8,656 manual annotations in 50 news articles. Second, we propose a method capable of extracting instances of bias by WCL while outperforming state-of-the-art methods, such as coreference resolution, which currently cannot resolve very broadly defined or abstract coreferences used by journalists. We evaluate our method on the NewsWCL50 dataset, achieving an F1=45.7% compared to F1=29.8% achieved by the best performing state-of-the-art technique. Lastly, we present a prototype demonstrating the effectiveness of our approach in finding frames caused by bias by WCL.
Abstract
The identification and extraction of the events that news articles report on is a commonly performed task in the analysis workflow of various projects that analyze news articles. However, due to the lack of universally usable and publicly available methods for news articles, many researchers must redundantly implement methods for event extraction to be used within their projects. Answers to the journalistic five W and one H questions (5W1H) describe the main event of a news story, i.e., who did what, when, where, why, and how. We propose Giveme5W1H, an open-source system that uses syntactic and domain-specific rules to extract phrases an swering the 5W1H. In our evaluation, we find that the extraction precision of 5W1H phrases is p=0.64, and p=0.79 for the first four W questions, which discretely describe an event.
Abstract
Media bias, i.e., slanted news coverage, can strongly impact the public perception of the reported topics. In the social sciences, research over the past decades has developed comprehensive models to describe media bias and effective, yet often manual and thus cumbersome, methods for analysis. In contrast, in computer science fast, automated, and scalable methods are available, but few approaches systematically analyze media bias. The models used to analyze media bias in computer science tend to be simpler compared to models established in the social sciences, and do not necessarily address the most pressing substantial questions, despite technically superior approaches. Computer science research on media bias thus stands to profit from a closer integration of models for the study of media bias developed in the social sciences with automated methods from computer science. This article first establishes a shared conceptual understanding by mapping the state of the art from the social sciences to a framework, which can be targeted by approaches from computer science. Next, we investigate different forms of media bias and review how each form is analyzed in the social sciences. For each form, we then discuss methods from computer science suitable to (semi-)automate the corresponding analysis. Our review suggests that suitable, automated methods from computer science, primarily in the realm of natural language processing, are already available for each of the discussed forms of media bias, opening multiple directions for promising further research in computer science in this area.
Abstract
Extraction of event descriptors from news articles is a commonly required task for various tasks, such as clustering related articles, summarization, and news aggregation. Due to the lack of generally usable and publicly available methods optimized for news, many researchers must redundantly implement such methods for their project. Answers to the five journalistic W questions (5Ws) describe the main event of a news article, i.e., who did what, when, where, and why. The main contribution of this paper is Giveme5W, the first open-source, syntax-based 5W extraction system for news articles. The system retrieves an article’s main event by extracting phrases that answer the journalistic 5Ws. In an evaluation with three assessors and 60 articles, we find that the extraction precision of 5W phrases is p = 0.7.
Abstract
We present an open source math-aware Question Answering System based on Ask Platypus. Our system returns as a single mathematical formula for a natural language question in English or Hindi. This formulae originate from the knowledge-base Wikidata. We translate these formulae to computable data by integrating the calculation engine sympy into our system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. In a user study, our system outperformed a commercial computational mathematical knowledge engine by 13 %. However, the performance of our system heavily depends on the size and quality of the formula data available in Wikidata. Since only a few items in Wikidata contained formulae when we started the project, we facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the article, 80 % of the suggestions were correct.
Abstract
Depending on the news source, a reader can be exposed to a different narrative and conflicting perceptions for the same event. Today, news aggregators help users cope with the large volume of news published daily. However, aggregators focus on presenting shared information, but do not expose the different perspectives from articles on same topics. Thus, users of such aggregators suffer from media bias, which is often implemented intentionally to influence public opinion. In this paper, we present NewsBird, an aggregator that presents shared and different information on topics. Currently, NewsBird reveals different perspectives on international news. Our system has led to insights about media bias and news analysis, which we use to propose approaches to be investigated in future research. Our vision is to provide a system that reveals media bias, and thus ultimately allows users to make their own judgement on the potential bias inherent in news.
Abstract
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multi-language, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.
Abstract
News aggregators capably handle the large amount of news that is published nowadays. However, these systems focus on identifying and presenting important, common information in news, but do not reveal different perspectives on the same topic. Differences in the content or presentation of news are referred to as media bias, which can have severe negative effects. Given their analysis approach, current news aggregators cannot effectively reveal media bias. To address this problem, we present matrix-based news analysis (MNA), a novel approach for news exploration that helps users gain a broad and diverse news understanding by presenting various perspectives on the same news topic. Additionally, we present NewsBird, a news aggregator that implements MNA for international news topics. The results of a case study demonstrate that NewsBird broadens the user’s news understanding while providing similar news aggregation functionalities as established systems.
Abstract
A representation of behavior of a selected group of events is obtained from a library, the representation including values corresponding to respective distributions of dimensions of the events. Similarity values indicating similarities between further events and the selected group of events are computed, the similarity values based on a combination of the representation of behavior and distances between the further events and the selected group of events. Further groups of the further events are computed based on the similarity values and behavior matching using the representation of behavior of the selected group of events. The further groups are visualized in a visualization.
Abstract
A method and/or computer program product recovers files that are generated by an application running on a client-server system that comprises a back-up client with a client back-up tool and a server with a server back-up tool. Application files are backed up on the server, and then restored to a back-up client based on file usage behavior of the application and their priority, and file stubs are created for remaining files. File usage behavior of the application performing data recovery and regular data processing after said restore process are monitored and analyzed, and files in different types of priority classes are classified based on file usage behavior. Existing file stubs at the back-up client are replaced with corresponding file content from the back-up server during runtime of the application based on predetermined criteria.
Abstract
Techniques are disclosed for determining reasons underlying insights gleaned from multi-dimensional data. In one embodiment, a contingency table is accessed that represents multiple dimensions of the data, in order to identify one or more insights. One or more dimensions, other than the represented dimensions, are evaluated to identify one or more reasons underlying a first insight of the one or more insights, and the one or more reasons are output.