Calibrate the perfect NLP Engine for the Anonymisation of Swiss Court Decisions
Hieronymus is the proud co-organiser of a hackathon on the automated anonymisation of Swiss court decisions. Together with our partners, Tilde and Pangeanic we are taking part in the 5th SwissText & 16th KONVENS Joint Conference 2020.
The topic of anonymisation is highly relevant and much discussed:
- In the era of GDPR and the redrafting of the Swiss Data Protection Act, companies must be provided with a tool that allows them to comply with new regulations in a cost-efficientand reliable manner.
- Anonymisation is key to progress in machine learning technology: without (anonymised) data, no engine can be trained.
- Under the principle of transparency, public administrations and courts are required to inform the public on their activities and publish their decisions, whilst protecting the parties’ privacy.
Data anonimisation thus typically aims to protect individuals’ privacy while maintaining the integrity of the data gathered and shared. In the technical lingo, personally identifiable information can be de-identifed, deleted, obfuscated, pseudo-anonymised or encrypted.The level of anonymisation required may vary depending on the use cases.
Deep Learning technology, combined with regular expressions and dictionaries, allows for a cost efficient “translation” from “non-anonymised” or “fully identifiable” to “anonymised” or “de-identified” texts. As for every other machine translation tool, engine specialisation is key to performance and precision.
We would like to explore the potential of a dedicated de-identification/anonymisation tool with the Hackathon participants, letting them discover the many challenges of effective anonymisation and come up with the best possible solutions. We would then share our experience and own approach with them, both as neural machine translation experts (Pangeanic and Tilde) and lawyer-linguists (Hieronymus).
The objectives of this Hack[_]thon are therefore:
1. To de-identify rather than to fully anonymise the datasets, stripping out names and obvious identifiers.
2. To understand and potentially suggest solutions to flag information that can easily allow areader to “re-identify” the individuals or parties concerned.
On the topic of re-identification, we recommend reading this article about an interesting research conducted by the University of Zurich.
The Hackathon will be divided in 2 challenges:
- Marathon: Participants will have 1 month to train a baseline anonymisation engine based on the (constrained) data made available to them – their goal is to produce a better baseline engine for the German language than the one provided by the organizers.
- Sprint: Participants will be given specific texts to de-identify/anonymise (Zurich Court Decisions) and will have 3.5 hours to improve the results by calibrating their own baseline engine (if they took part in the Marathon) or by the baseline engine of Pangeanic/Tilde, adding pre- and post-processing rules.
Participants can use any dictionary or library of their choosing. The participants will get a short introduction (also available in written form) to the specificities of Court Decision Anonymisation and the expectations of Swiss courts.
The results of the Sprint will be assessed with automatic metrics (Precision, Recall and F1-score). The 3 best results will then be subjected to human evaluation (Lawyer-linguists and a representative of a Zurich Court).
While the assessment is taking place, the participants will have a coffee/beer break (approx. 1 hour), followed by a wrap-up and prize award.
Marathon: Task Description
The task will start one month before the Sprint challenge.We will provide:
- A baseline NLP framework
- Tagged data in German for training an anonymisation engine. The data will be constrained in order to allow for a fair comparison of the results. Note that participants are allowed to use basic linguistic resources such as taggers, segmenters, dictionaries or libraries of their choosing
The task consists in improving the baseline. The goal is to share interesting ideas from any areaof interest in the scientific community, the public administration and the industry, as well as to improve the state-of-the-art (e.g.: [Paris, 2019, Désidentification de comptes-rendus hospitaliersdans une base de données OMOP, https://imia.limsi.fr/talmed2019/talmed2019-program.html]).
At the end of the Marathon, Participants will write a short report explaining their system and highlighting how their methods differ from the baseline framework, making clear which tools andtraining sets have been used.
On the day of the Sprint, Participants to the Marathon will use their own systems to anonymise atest set of unseen sentences. The quality will be measured by automatic evaluation metrics (Precision, Recall and F1-score).
The Hackathon would be organized jointly by Pangeanic, Tilde and Hieronymus.
- Pangeanic is a leading language processing company, developing tools that combine NLP and Artificial Intelligence. It creates new solutions for machine translation, e-Discovery, automatic content classification, summarization and audio transcription. Pangeanic has successfully built anonymisation tools for several European languages and Japanese.
- Tilde is a European leader in language technology, developing custom machine translation systems and offering online terminology tools for a wide range of languages. They excel in data selection, collection and cleaning, paving the way for new high quality NLP solutions.
- Hieronymus AG – Translations by Lawyers for Lawyers, is a Swiss legal translation agency actively embracing the technological revolution of NMT: In February 2020, Hieronymus has launched the beta version of LEXMachina, the first Neural Translation Engine specializing in Swiss law, developed in collaboration with Tilde. Our team of Senior Lawyer-Linguistsand MT Linguists are working hand in hand with our MT Partners to provide cutting-edge MT-tools tailor-made for the special needs of the Swiss legal industry. Releasing an efficient Anonymisation Tool for Swiss lawyers is our target for 2020.
Pangeanic and Tilde have been awarded EU funds to develop an open-source Multilingual Anonymisation Toolkit for all EU languages, able to detect and de-identify personal data (name, addresses, emails, credit cardand bank accounts, etc.). The tool will enable EU public administrations tocomply with GDPR requirements, in particular in the health and legal fields.
- Release of training data and baseline framework: 22 May 2020
- Submission deadline for the Marathon results (short report): 22 June 2020 (10:00 am CET)
- Sprint day: 23 June 2020