Matching entries: 0
settings...
Heindorf S, Potthast M, Stein B and Engels G (2015), "Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis", In 38th International ACM Conference on Research and Development in Information Retrieval (SIGIR 15). , pp. 831-834. ACM.
Abstract: We report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 million manual revisions, we have identified more than 100,000 cases of vandalism. An in-depth corpus analysis lays the groundwork for research and development on automatic vandalism detection in public knowledge bases. Our analysis shows that 58% of the vandalism revisions can be found in the textual portions of Wikidata, and the remainder in structural content, e.g., subject-predicate-object triples. Moreover, we find that some vandals also target Wikidata content whose manipulation may impact content displayed on Wikipedia, revealing potential vulnerabilities. Given today's importance of knowledge bases for information systems, this shows that public knowledge bases must be used with caution.
BibTeX:
@inproceedings{heindorf2015,
  author = {Stefan Heindorf AND Martin Potthast AND Benno Stein AND Gregor Engels},
  title = {Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis},
  booktitle = {38th International ACM Conference on Research and Development in Information Retrieval (SIGIR 15)},
  publisher = {ACM},
  year = {2015},
  pages = {831-834},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Towards%20Vandalism%20Detection%20in%20Knowledge%20Bases.pdf},
  doi = {10.1145/2766462.2767804}
}
Wachsmuth H and Stein B (2012), "Optimal Scheduling of Information Extraction Algorithms", In Proceedings of the 24th International Conference on Computational Linguistics: Posters. Mumbai, India , pp. 1281-1290. The COLING 2012 Organizing Committee.
Abstract: Most research on run-time efficiency in information extraction is of empirical nature. This paper analyzes the efficiency of information extraction pipelines from a theoretical point of view in order to explain empirical findings. We argue that information extraction can, at its heart, be viewed as a relevance filtering task whose efficiency traces back to the run-times and selectivities of the employed algorithms. To better understand the intricate behavior of information extraction pipelines, we develop a sequence model for scheduling a pipeline's algorithms. In theory, the most efficient schedule corresponds to the Viterbi path through this model and can hence be found by dynamic programming. For real-time applications, it might be too expensive to compute all run-times and selectivities beforehand. However, our model implies the benchmarks of filtering tasks and illustrates that the optimal schedule depends on the distribution of relevant information in the input texts. We give formal and experimental evidence where necessary.
BibTeX:
@inproceedings{wachsmuth:2012,
  author = {Henning Wachsmuth AND Benno Stein},
  title = {Optimal Scheduling of Information Extraction Algorithms},
  booktitle = {Proceedings of the 24th International Conference on Computational Linguistics: Posters},
  publisher = {The COLING 2012 Organizing Committee},
  year = {2012},
  pages = {1281-1290},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Optimal%20Scheduling%20of%20Information%20Extraction%20Algorithms.pdf}
}
Wachsmuth H, Rose M and Engels G (2013), "Automatic Pipeline Construction for Real-Time Annotation", In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics. Samos, Greece, March, 2013. Vol. 7816, pp. 38-49. Springer.
Abstract: Many annotation tasks in computational linguistics are tackled with manually constructed pipelines of algorithms. In real-time tasks where information needs are stated and addressed ad-hoc, however, manual construction is infeasible. This paper presents an artificial intelligence approach to automatically construct annotation pipelines for given information needs and quality prioritizations. Based on an abstract ontological model, we use partial order planning to select a pipeline's algorithms and informed search to obtain an efficient pipeline schedule. We realized the approach as an expert system on top of Apache UIMA, which offers evidence that pipelines can be constructed ad-hoc in near-zero time.
BibTeX:
@inproceedings{wachsmuth:2013a,
  author = {Henning Wachsmuth AND Mirko Rose AND Gregor Engels},
  editor = {Alexander Gelbukh},
  title = {Automatic Pipeline Construction for Real-Time Annotation},
  booktitle = {Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics},
  publisher = {Springer},
  year = {2013},
  volume = {7816},
  pages = {38-49},
  note = {ISBN: 978-3-642-37246-9},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Automatic%20Pipeline%20Construction%20for%20Real-Time%20Annotation.pdf},
  doi = {10.1007/978-3-642-37247-6_4}
}
Wachsmuth H, Stein B and Engels G (2013), "Learning Efficient Information Extraction on Heterogeneous Texts", In Proceedings of the 6th Internation Joint Conference on Natural Language Processing. Nagoya, Japan, October, 2013. , pp. 534-542. AFNLP.
Abstract: From an efficiency viewpoint, information extraction means to filter the relevant portions of natural language texts as fast as possible. Given an extraction task, different pipelines of algorithms can be devised that provide the same precision and recall but that vary in their run-time due to different pipeline schedules. While recent research has investigated how to determine the run-time optimal schedule for a collection or a stream of texts, this paper goes one step beyond: we analyze the run-time of efficient schedules as a function of the heterogeneity of texts and we show how this heterogeneity is characterized from a data perspective. For extraction tasks on heterogeneous big data, we present a self-supervised online adaptation approach that learns to predict the optimal schedule depending on the input text. Our evaluation suggests that the approach will significantly improve efficiency on collections and streams of texts of high heterogeneity.
BibTeX:
@inproceedings{wachsmuth:2013b,
  author = {Henning Wachsmuth AND Benno Stein AND Gregor Engels},
  title = {Learning Efficient Information Extraction on Heterogeneous Texts},
  booktitle = {Proceedings of the 6th Internation Joint Conference on Natural Language Processing},
  publisher = {AFNLP},
  year = {2013},
  pages = {534-542},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Learning%20Efficient%20Information%20Extraction%20on%20Heterogeneous%20Texts.pdf}
}
Wachsmuth H, Stein B and Engels G (2013), "Information Extraction as a Filtering Task", In Proceedings of the 22nd ACM Conference on Information and Knowledge Management. San Francisco, CA, USA , pp. 2049-2058. ACM.
Abstract: Information extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such an approach lacks an efficient control of the input of the analysis steps. In this paper, we hence propose and evaluate a model and a formal approach that consistently put the filtering view in the focus: Before spending annotation effort, filter those portions of the input texts that may contain relevant information for filling a template and discard the others. We model all dependencies between the semantic concepts sought for with a truth maintenance system, which in turn infers the portions of text to be annotated in each analysis step. The filtering view enables an information extraction system (1) to annotate only relevant portions of input texts and (2) to easily trade its run-time efficiency for its recall. We provide our approach as an open-source extension of the Apache UIMA framework and we show the potential of our approach in a number of experiments.
BibTeX:
@inproceedings{wachsmuth:2013c,
  author = {Henning Wachsmuth AND Benno Stein AND Gregor Engels},
  title = {Information Extraction as a Filtering Task},
  booktitle = {Proceedings of the 22nd ACM Conference on Information and Knowledge Management},
  publisher = {ACM},
  year = {2013},
  pages = {2049-2058},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Information%20Extraction%20as%20a%20Filtering%20Task.pdf},
  doi = {10.1145/2505515.2505557}
}
Wachsmuth H, Trenkmann M, Stein B and Engels G (2014), "Modeling Review Argumentation for Robust Sentiment Analysis", In Proceedings of the 25th International Conference on Computational Linguistics. Dublin, Ireland, August, 2014. , pp. 553-564. Dublin City University and Association for Computational Linguistics.
Abstract: Most text classification approaches model texts at the lexical and syntactic level only, lacking domain robustness and explainability. In tasks like sentiment analysis, such approaches can result in limited effectiveness if the texts consist of a series of arguments. In this paper, we claim that even a shallow model of the argumentation of the texts allows for an effective and more robust classification, while providing intuitive explanations of the classification results. Here, we apply this idea to the statistical prediction of sentiment scores for reviews. We combine existing ideas from sentiment analysis with novel features that compare the overall argumentation structure of a review text to a learned set of common sentiment flow patterns. Our evaluation in two domains demonstrates the benefit of modeling argumentation and its abstract structure for text classification in terms of effectiveness and domain robustness.
BibTeX:
@inproceedings{wachsmuth:2014b,
  author = {Henning Wachsmuth AND Martin Trenkmann AND Benno Stein AND Gregor Engels},
  title = {Modeling Review Argumentation for Robust Sentiment Analysis},
  booktitle = {Proceedings of the 25th International Conference on Computational Linguistics},
  publisher = {Dublin City University and Association for Computational Linguistics},
  year = {2014},
  pages = {553-564},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Modeling%20Review%20Argumentation%20for%20Robust%20Sentiment%20Analysis.pdf}
}
Wachsmuth H, Prettenhofer P and Stein B (2010), "Efficient Statement Identification for Automatic Market Forecasting", In Proceedings of the 23rd International Conference on Computational Linguistics. Beijing, China , pp. 1128-1136. ACM.
Abstract: Strategic business decision making involves the analysis of market forecasts. Today, the identification and aggregation of relevant market statements is done by human experts, often by analyzing documents from the World Wide Web. We present an efficient information extraction chain to automate this complex natural language processing task and show results for the identification part. Based on time and money extraction, we identify sentences that represent statements on revenue using support vector classification. We provide a corpus with German online news articles, in which more than 2,000 such sentences were annotated by domain experts from the industry. On the test data, our identification algorithm achieves overall precision and recall of 0.86 and 0.87, respectively.
BibTeX:
@inproceedings{wachsmuth2010a,
  author = {Henning Wachsmuth AND Peter Prettenhofer AND Benno Stein},
  title = {Efficient Statement Identification for Automatic Market Forecasting},
  booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics},
  publisher = {ACM},
  year = {2010},
  pages = {1128-1136}
}
Wachsmuth H, Stein B and Engels G (2011), "Constructing Efficient Information Extraction Pipelines", In Proceedings of the 20th ACM Conference on Information and Knowledge Management. Glasgow, Scotland, October, 2011. , pp. 2237-2240. ACM.
Abstract: Information Extraction (IE) pipelines analyze text through several stages. The pipeline's algorithms determine both its effectiveness and its run-time efficiency. In real-world tasks, however, IE pipelines often fail acceptable run-times because they analyze too much task-irrelevant text. This raises two interesting questions: 1) How much "efficiency potential" depends on the scheduling of a pipeline's algorithms? 2) Is it possible to devise a reliable method to construct efficient IE pipelines?
Both questions are addressed in this paper. In particular, we show how to optimize the run-time efficiency of IE pipelines under a given set of algorithms. We evaluate pipelines for three algorithm sets on an industrially relevant task: the extraction of market forecasts from news articles. Using a system-independent measure, we demonstrate that efficiency gains of up to one order of magnitude are possible without compromising a pipeline's original effectiveness.
BibTeX:
@inproceedings{wachsmuth2011a,
  author = {Henning Wachsmuth AND Benno Stein AND Gregor Engels},
  title = {Constructing Efficient Information Extraction Pipelines},
  booktitle = {Proceedings of the 20th ACM Conference on Information and Knowledge Management},
  publisher = {ACM},
  year = {2011},
  pages = {2237-2240},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Constructing%20Efficient%20Information%20Extraction%20Pipelines.pdf}
}
Wachsmuth H and Bujna K (2011), "Back to the Roots of Genres: Text Classification by Language Function", In Proceedings of the 5th International Joint Conference on Natural Language Processing. Chiang Mai, Thailand , pp. 632-640. AFNLP.
Abstract: The term "genre" covers different aspects of both texts and documents, and it has led to many classification schemes. This makes different approaches to genre identification incomparable and the task itself unclear. We introduce the linguistically motivated text classification task language function analysis, LFA, which focuses on one well-defined aspect of genres. The aim of LFA is to determine whether a text is predominantly expressive, appellative, or informative. LFA can be used in search and mining applications to efficiently filter documents of interest. Our approach to LFA relies on fast machine learning classifiers with features from different research areas. We evaluate this approach on a new corpus with 4,806 product texts from two domains. Within one domain, we correctly classify up to 82% of the texts, but differences in feature distribution limit accuracy on out-of-domain data.
BibTeX:
@inproceedings{wachsmuth2011b,
  author = {Henning Wachsmuth AND Kathrin Bujna},
  title = {Back to the Roots of Genres: Text Classification by Language Function},
  booktitle = {Proceedings of the 5th International Joint Conference on Natural Language Processing},
  publisher = {AFNLP},
  year = {2011},
  pages = {632-640},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Back%20to%20the%20Roots%20of%20Genres.pdf}
}
Wachsmuth H, Trenkmann M, Stein B, Engels G and Palakarska T (2014), "A Review Corpus for Argumentation Analysis", In Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing. Kathmandu, Nepal, April, 2014. Vol. 8404(2), pp. 115-127. Springer.
Abstract: The analysis of user reviews has become critical in research and industry, as user reviews increasingly impact the reputation of products and services. Many review texts comprise an involved argumentation with facts and opinions on different product features or aspects. Therefore, classifying sentiment polarity does not suffice to capture a review's impact. We claim that an argumentation analysis is needed, including opinion summarization, sentiment score prediction, and others. Since existing language resources to drive such research are missing, we have designed the ArguAna TripAdvisor corpus, which compiles 2,100 manually annotated hotel reviews balanced with respect to the reviews' sentiment scores. Each review text is segmented into facts, positive, and negative opinions, while all hotel aspects and amenities are marked. In this paper, we present the design and a first study of the corpus. We reveal patterns of local sentiment that correlate with sentiment scores, thereby defining a promising starting point for an effective argumentation analysis.
BibTeX:
@inproceedings{wachsmuth2014a,
  author = {Henning Wachsmuth AND Martin Trenkmann AND Benno Stein AND Gregor Engels AND Tsvetomira Palakarska},
  editor = {Alexander Gelbukh},
  title = {A Review Corpus for Argumentation Analysis},
  booktitle = {Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing},
  publisher = {Springer},
  year = {2014},
  volume = {8404},
  number = {2},
  pages = {115-127},
  note = {ISBN: 978-3-642-54902-1},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/A%20Review%20Corpus%20for%20Argumentation%20Analysis.pdf},
  doi = {10.1007/978-3-642-54903-8_10}
}
Wachsmuth H (2015), "Pipelines for Ad-hoc Large-scale Text Mining". Thesis at: University of Paderborn, to be published in LNCS.
Abstract: Today's web search and big data analytics applications aim to address information needs~(typically given in the form of search queries) ad-hoc on large numbers of texts. In order to directly return relevant information instead of only returning potentially relevant texts, these applications have begun to employ text mining. The term text mining covers tasks that deal with the inference of structured high-quality information from collections and streams of unstructured input texts. Text mining requires task-specific text analysis processes that may consist of several interdependent steps. These processes are realized with sequences of algorithms from information extraction, text classification, and natural language processing. However, the use of such text analysis pipelines is still restricted to addressing a few predefined information needs. We argue that the reasons behind are three-fold:

First, text analysis pipelines are usually made manually in respect of the given information need and input texts, because their design requires expert knowledge about the algorithms to be employed. When information needs have to be addressed that are unknown beforehand, text mining hence cannot be performed ad-hoc. Second, text analysis pipelines tend to be inefficient in terms of run-time, because their execution often includes analyzing texts with computationally expensive algorithms. When information needs have to be addressed ad-hoc, text mining hence cannot be performed in the large. And third, text analysis pipelines tend not to robustly achieve high effectiveness on all texts, because their results are often inferred by algorithms that rely on domain-dependent features of texts. Hence, text mining currently cannot guarantee to infer high-quality information.

In this thesis, we contribute to the question of how to address information needs from text mining ad-hoc in an efficient and domain-robust manner. We observe that knowledge about a text analysis process and information obtained within the process help to improve the design, the execution, and the results of the pipeline that realizes the process. To this end, we apply different techniques from classical and statistical artificial intelligence. In particular, we first develop knowledge-based approaches for an ad-hoc pipeline construction and for an optimal execution of a pipeline on its input. Then, we show theoretically and practically how to optimize and adapt the schedule of the algorithms in a pipeline based on information in the analyzed input texts in order to maximize execution efficiency. Finally, we learn patterns in the argumentation structures of texts statistically that remain strongly invariant across domains and that, thereby, allow for more robust analysis results in a restricted set of tasks.

We formally analyze all developed approaches and we implement them as open-source software applications. Based on these applications, we evaluate the approaches on established and on newly created collections of texts for scientifically and industrially important text analysis tasks, such as financial event extraction and fine-grained sentiment analysis. Our findings show that text analysis pipelines can be designed automatically, which process only portions of text that are relevant for the information need at hand. Through scheduling, the run-time efficiency of pipelines can be improved by up to more than one order of magnitude while maintaining effectiveness. Moreover, we provide evidence that a pipeline's domain robustness substantially benefits from focusing on argumentation structure in tasks like sentiment analysis. We conclude that our approaches denote essential building blocks of enabling ad-hoc large-scale text mining in web search and big data analytics applications.

BibTeX:
@phdthesis{wachsmuth2015a,
  author = {Henning Wachsmuth},
  title = {Pipelines for Ad-hoc Large-scale Text Mining},
  school = {University of Paderborn, to be published in LNCS},
  year = {2015},
  url = {https://groups.uni-paderborn.de/fg-engels/Forschung/IIP/PDFs/Pipelines%20for%20Ad-hoc%20Large-scale%20Text%20Mining.pdf}
}