Publikationen von Prof. Dr. René Peinl

VLM@school – Evaluation of AI image understanding on German middle school knowledge

Peinl, René; Tischler, Vincent (2025)

Future Technologies Conference (FTC) 2025, Munich, Germany 2025.

Open Access Peer Reviewed

ABSTRACT

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.

Practicing of home nursing activities in a VR simulation

Peinl, René; Eren, Özgür (2025)

11th International Conference of the Immersive Learning Research Network (iLRN2025), June 16-19, 2025, Chicago, IL, United States 2025.

Peer Reviewed

ABSTRACT

Virtual reality has proven to be a valuable addition in the tool belt of teachers. Immersive learning environments are applied in various settings, including, but not limited to the medical and nursing domain. In this study we present “We care in VR”, a simulation for practicing nursing tasks for care at home, a part of nursing that is currently underrepresented in available VR applications. We investigate how realistic interactions are perceived by end users compared to consistent usage of buttons on the controllers and how they affect the ease of use of the simulation. We conduct an empirical study with 50 participants from three vocational schools of nursing and a university of applied sciences. Results suggest that our simulation already works quite well and is accepted by the target group, but still needs improvement regarding ease of use, especially for users without any previous experience with VR applications.

Using LLMs as prompt modifier to avoid biases in AI image generators

Peinl, René (2025)

9th International Conference on Advances in Artificial Intelligence (ICAAI 2025), September 11-13, 2025 in Manchester, UK 2025.

Open Access Peer Reviewed

ABSTRACT

This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at https://iisys-hof.github.io/llm-prompt-img-gen/

To Model, to Prompt, or to Code? The Choice is Yours — A Multi-paradigmatic Approach to Software Development

Buchmann, Thomas; Schwägerl, Felix; Peinl, René (2025)

20th International Conference on Software Technologies. 10-12.06.2025, Bilbao, Spain .

Open Access Peer Reviewed

ABSTRACT

This paper considers three fundamental approaches to software development, namely manual coding, modeldriven software engineering, and code generation by large language models. All of these approaches have their individual pros and cons, motivating the desire for an integrated approach. We present MoProCo, a technical solution to integrate the three approaches into a single tool chain, allowing the developer to split a software engineering task into modeling, prompting or coding sub-tasks. From a single input file consisting of static model structure, natural language prompts and/or source code fragments, Java source code is generated using a two-stage approach. A case study demonstrates that the MoProCo approach combines the desirable properties of the three development approaches by offering the appropriate level of abstraction, determinism, and dynamism for each specific software engineering sub-task.

Benchmarking Vision Language Models on German Factual Data

Peinl, René; Tischler, Vincent (2025)

21st International Conference on Artificial Intelligence Applications and Innovations, 26 – 29 June, 2025, Limassol, Cyprus.

Open Access Peer Reviewed

ABSTRACT

Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accuracy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents according to the scientific name or English common name but fail in German language. Cars and supermarket products were identified equally well in English and German images across both prompt languages.

Komprimierte KI - Wie Quantisierung große Sprachmodelle verkleinert

Peinl, René (2025)

c't - Magzin für Computertechnik 2025 (2), S. 120-125.

ABSTRACT

Große Sprachmodelle wie ChatGPT benötigen große und teure Server und viel Energie. Man kann sie aber quantisieren, sodass sie mit viel weniger Speicher und Strom auskommen und sogar lokal auf einem Smartphone laufen. Wir erklären, warum quantisierte Modelle viel schneller antworten und trotzdem fast so schlau sind wie die großen Originale.

Using LLMs to Improve Reproducibility of Literature Reviews.

Peinl, René; Haberl, Armin; Baernthaler, Jonathan; Chouguley, Sarang...

SIGSDA Symposium at the International Conference on Information Systems 2024. Bangkok, Thailand.

Open Access Peer Reviewed

ABSTRACT

Literature reviews play a crucial role in Information Systems (IS) research. However, scholars have expressed concerns regarding the reproducibility of their results and the quality of documentation. The involvement of human reproducers in these reviews is often hindered by the time-consuming nature of the procedures. The emergence of Large Language Models (LLMs) seems promising to support researchers and to enhance reproducibility. To explore this potential, we conducted experiments using various LLMs, focusing on abstract scanning, and have presented initial evidence suggesting that the application of LLMs in structured literature reviews could assist researchers in refining and formulating rules for abstract scanning. Based on our preliminary findings, we identify potential future research directions in this research in progress paper.

Comparing human-labeled and AI-labeled speech datasets for TTS

Wirth, Johannes; Peinl, René (2024)

4th European Conference on the Impact of Artificial Intelligence and Robotics (ICAIR 2024) 2024.

Open Access Peer Reviewed

ABSTRACT

As the output quality of neural networks in the fields of automatic speech recognition (ASR) and text-to-speech (TTS) continues to improve, new opportunities are becoming available to train models in a weakly supervised fashion, thus minimizing the manual effort required to annotate new audio data for supervised training. While weak supervision has recently shown very promising results in the domain of ASR, speech synthesis has not yet been thoroughly investigated regarding this technique despite requiring the equivalent training dataset structure of aligned audio-transcript pairs.
In this work, we compare the performance of TTS models trained using a well-curated and manually labeled training dataset to others trained on the same audio data with text labels generated using both grapheme- and phoneme-based ASR models. Phoneme-based approaches seem especially promising, since even for wrongly predicted phonemes, the resulting word is more likely to sound similar to the originally spoken word than for grapheme-based predictions.
For evaluation and ranking, we generate synthesized audio outputs from all previously trained models using input texts sourced from a selection of speech recognition datasets covering a wide range of application domains. These synthesized outputs are subsequently fed into multiple state-of-the-art ASR models with their output text predictions being compared to the initial TTS model input texts. This comparison enables an objective assessment of the intelligibility of the audio outputs from all TTS models, by utilizing metrics like word error rate and character error rate.
Our results not only show that models trained on data generated with weak supervision achieve comparable quality to models trained on manually labeled datasets, but can outperform the latter, even for small, well-curated speech datasets. These findings suggest that the future creation of labeled datasets for supervised training of TTS models may not require any manual annotation but can be fully automated.

Ethical Generative AI – What Kind of AI Results are Desired by Society?

Peinl, René; Wagener, Andreas; Lehmann, Marc (2024)

4th European Conference on the Impact of Artificial Intelligence and Robotics (ICAIR 2024), Lisbon, Portugal 2024.

Open Access Peer Reviewed

ABSTRACT

There are many publications talking about the biases to be found in in generative AI solutions like large language models (LLMs, e.g., Mistral) or text-to-image models (T2IMs, e.g., Stable Diffusion). However, there is merely any publication to be found that questions what kind of behavior is actually desired, not only by a couple of researchers, but by society in general. Most researchers in this area seem to think that there would be a common agreement, but political debate in other areas shows that this is seldom the case, even for a single country. Climate change, for example, is an empirically well-proven scientific fact, 197 countries (including Germany) have declared to do their best to limit global warming to a maximum of 1.5°C in the Paris Agreement, but still renowned German scientists are calling LLMs biased if they state that there is human-made climate change and humanity is doing not enough to stop it. This trend is especially visible in Western individualistic societies that favor personal well-being over common good. In this article, we are exploring different aspects of biases found in LLMs and T2IMs, highlight potential divergence in the perception of ethically desirable outputs and discuss potential solutions with their advantages and drawbacks from the perspective of society. The analysis is carried out in an interdisciplinary manner with the authors coming from as diverse backgrounds as business information systems, political sciences, and law. Our contribution brings new insights to this debate and sheds light on an important aspect of the discussion that is largely ignored up to now.

Die innere Stimme - Wenn der Chatbot den Roboter steuert.

Peinl, René (2024)

c't Magazin für Computertechnik 2024 (23), S. 130-132.

ABSTRACT

Roboter, die autonom und flexibel arbeiten, könnten in Zukunft im Haushalt helfen. Um ihre Schritte zu planen, brauchen sie künstliche Intelligenz. Generative Sprachmodelle sollen dafür nicht nur Sätze oder Programmcode schreiben, sondern die Abläufe auch strukturieren.

White-box LLM-supported Low-code Engineering: A Vision and First Insights

Buchmann, Thomas; Peinl, René; Schwägerl, Felix (2024)

27th International Conference on Model Driven Engineering Languages and Systems (Models 2024) 2024, S. 556--560.
DOI: 10.1145/3652620.3687803

Open Access Peer Reviewed

ABSTRACT

Low-code development (LCD) platforms promise to empower citizen developers to define core domain models and rules for business applications. However, as domain rules grow complex, LCD platforms may fail to do so effectively. Generative AI, driven by large language models (LLMs), offers source code generation from natural language but suffers from its non-deterministic black-box nature and limited explainability. Therefore, rather than having LLMs generate entire applications from single prompts, we advocate for a white-box approach allowing citizen developers to specify domain models semi-formally, attaching constraints and operations as natural language annotations. These annotations are fed incrementally into an LLM contextualized with the generated application stub. This results in deterministic and better explainable generation of static application components, while offering citizen developers an appropriate level of abstraction. We report on a case study in manufacturing execution systems, where the implementation of the approach provides first insights.

Mit allen Sinnen - Multimodale KIs kombinieren Bild und Text.

Peinl, René (2024)

c't Magazin für Computertechnik 2024 (11), S. 52-56.

ABSTRACT

Kaum hat sich der Mensch an Text- und Bildgeneratoren gewöhnt, veröffentlichen OpenAI, Google, Microsoft und Meta ihre multimodalen Modelle, die beide Welten vereinen. Das ermöglicht praktischen KI-Anwendungen und sogar Robotern ein umfassenderes Verständnis der Welt.

Evaluation of Medium-Sized Language Models in German and English Language

Peinl, René; Wirth, Johannes (2024)

International Journal on Natural Language Computing (IJNLC) 2024 (1).

Open Access

ABSTRACT

Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering in German and English language, which requires models to provide elaborate answers without external document retrieval (RAG). The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best English MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. The best German model also surpasses ChatGPT for the equivalent dataset. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.

Wenn Prozessmodellierung Realität wird - Modernes Geschäftsprozessmanagement

Peinl, René (2023)

IM+io - Best Practices aus Digitalisierung | Management | Wissenschaft 2023 (4).

ABSTRACT

In der Industrie wird immer nopch viel Zeit damit verschwendet, dass Geschäftprozesse als Grafik modelliert werden, um einen Überblick zu bekommen, sie zu analysieren und zu verbessern. Für die Softwareunterstützung als ausführbare Prozesse in einem Enterprise Information System wie ERP oder MES müssen sie jedoch noch einmal systemspezifisch implementiert werden. Im Produktionsumfeld geht das mit dem Open Source System HiCuMES, einer Mischung aus LowCode Werkzeug mit grafischen Editoren und Manufacturing Execution System auch anders.

SPORENLP: A Spatial Recommender System for Scientific Literature

Wirth, Johannes; Roßner, Daniel; Peinl, René; Atzenbeck, Claus (2023)

Proceedings of the 19th International Conference on Web Information Systems and Technology (WEBIST'23) 2023, S. 429–436.
DOI: 10.5220/0012210400003584

Open Access Peer Reviewed

ABSTRACT

SPORENLP is a recommendation system designed to review scientific literature. It operates on a sub-dataset comprising 15,359 publications, with a total of 117,941,761 pairwise comparisons. This dataset includes both metadata comparisons and text-based similarity aspects obtained using natural language processing (NLP) techniques.Unlike other recommendation systems, SPORENLP does not rely on specific aspect features. Instead, it identifies the top k candidates based on shared keywords and embedding-related similarities between publications, enabling content-based, intuitive, and adjustable recommendations without excluding possible candidates through classification. To provide users with an intuitive interface for interacting with the dataset, we developed a web-based front-end that takes advantage of the principles of spatial hypertext. A qualitative expert evaluation was conducted on the dataset. The dataset creation pipeline and the source code for SPORENLP will be made freely available to the research community, allowing further exploration and improvement of the system.

Klein aber fein - Wie kompakte Sprachmodelle die Giganten herausfordern

Peinl, René (2023)

c't - Magazin für Computertechnik 2023 (26), S. 50-55.

ABSTRACT

Eine Zeitlang kannte die Parameterzahl großer Sprachmodelle nur eine Richtung: steil nach oben. Mehr Parameter bedingen mehr und hochwertigere Fähigkeiten, so die Überzeugung. Doch 2023 schlug die Stunde der mittelgroßen SprachKIs: Sie sind genügsam – und erstaunlich konkurrenzfähig. In mancher Disziplin rücken sie erstaunlich nahe an GPT-4 mit seinen kolportierten 1,8 Billionen Parametern heran. Damit tut sich ein riesiges Potenzial auf – auch für kleinere und mittelgroße Unternehmen, die mit eigenen Anwendungen liebäugeln. Wir erklären, was die schlanken Verwandten der Giganten können, was sie so effizient macht und wie die Zukunft der Sprachmodelllandschaft aussehen könnte.

Evaluation of medium-large Language Models at zero-shot closed book generative question answering

Peinl, René; Wirth, Johannes (2023)

11th International Conference on Artificial Intelligence and Applications (AIAP) 2023.

Open Access Peer Reviewed

ABSTRACT

Large language models (LLMs) have garnered significant attention, but the definition of "large" lacks clarity. This paper focuses on medium-sized lan-guage models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot genera-tive question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test da-taset and presents results from human evaluation. Results show that combin-ing the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 46.4% and has 7B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers.

The Hochschul-Assistenz-System HAnS: An ML-Based Learning Experience Platform

Ranzenberger, Thomas; Bocklet, Tobias; Freisinger, Steffen; Frischholz, Lia...

Elektronische Sprachsignalverarbeitung 2023 2023.

Open Access Peer Reviewed

ABSTRACT

The usage of e-learning platforms, online lectures and online meetings for academic teaching increased during the Covid-19 pandemic. Lecturers created video lectures, screencasts, or audio podcasts for online learning. The Hochschul-Assistenz-System (HAnS) is a learning experience platform that uses machine learning (ML) methods to support students and lecturers in the online learning and teaching processes. HAnS is being developed in multiple iterations as an agile open-source collaborative project supported by multiple universities and partners. This paper presents the current state of the development of HAnS on German video lectures.

Dependencies between MES features and efficient introduction

Peinl, René; Purucker, Susanne K; Vogel, Sabine (2022)

14th International Conference on ENTERprise Information Systems (CENTERIS 2022).

Peer Reviewed

Automatic Speech Recognition in German - A Detailled Error Analysis

Wirth, Johannes; Peinl, René (2022)

IEEE Coins - International Conference on Omni Layer Intelligent Systems.

Open Access Peer Reviewed

ABSTRACT

The amount of freely available systems for automatic speech recognition (ASR) based on neural networks is growing steadily, with equally increasingly reliable predictions. However, the evaluation of trained models is typically exclusively based on statistical metrics such as WER or CER, which do not provide any insight into the nature or impact of the errors produced when predicting transcripts from speech input. This work presents a selection of ASR model architectures that are pretrained on the German language and evaluates them on a benchmark of diverse test datasets. It identifies cross-architectural prediction errors, classifies those into categories and traces the sources of errors per category back into training data as well as other sources. Finally, it discusses solutions in order to create qualitatively better training datasets and more robust ASR systems.

Prof. Dr. René Peinl

Hochschule für Angewandte Wissenschaften Hof

Forschungsgruppe Systemintegration (SI)
Alfons-Goppel-Platz 1
95028 Hof

T +49 9281 409-4820
rene.peinl[at]hof-university.de