The client is a company that processes data collected from the media and finds insights within that data. They provide services in the form of various products - web and mobile applications through dashboards and reports. These products are used by the client’s customers. Behind all of that, there are a lot of complex tasks implemented by old software solutions and outdated services. The primary purpose of our work is to improve these existing products by replacing old microservices and help other teams by automating software with modern solutions. This will be accomplished by using multiple Natural Language Processing (NLP) models and AI solutions.
Our goal was to use Natural Language Processing (NLP) backend services to improve solution performance and accuracy. To name a few, we identified areas such as:
By making every NLP model part of an independent microservice we provide an API for each of them which can be further exposed to any of the aforementioned client’s products. With these APIs used in the product, we will significantly reduce the time necessary for generating final results.
For data wrangling, we used standard Python libraries, such as NumPy, Pandas, PyTorch, and NLTK. For NLP models, we used Hugging Face: an open-source platform provider of NLP technologies. Most models used are finetuned with the client’s data using JupyterHub hosted as an on-premises solution. Depending on the task we aim to solve, we used models with complex architecture like BERT, roBERTa, GPT2, etc. These models needed to be language-agnostic, as they will be used in multiple countries, which was one of the main reasons why we manually implemented the solution.
In order for a Data Scientist to give a maximal contribution to a project, the knowledge of Machine Learning is not enough. This is why we expanded our knowledge of the following technologies:
At the beginning of our journey on this project, we were familiar with NLP. But since most of the materials and documentation were written in German, we prepared ourselves and started taking German classes. This step proved to be helpful for us since we could analyze the textual results and decisions made by NLP models much more quickly. We actively participated in discussions on searching for the best model to use for each microservice and contributed with different suggestions - we always stay up to date by reading the latest scientific papers as well as the newest libraries used in the field of NLP.
For each NLP model we deployed, we asked for feedback from other teams in the company on their performance to improve our solutions and fulfill the client's new requirements. With the received feedback, we always suggested multiple options for improvement, and through the iterative testing of the results, we always reached the most suitable solutions.
In order for other teams to understand what was happening behind the scenes with NLP microservices, we made a demo application for all the employees from the client's company so that they can try out these models and see what the results look like. Furthermore, we stored some of the results in a Elasticsearch cluster to optimize the execution time for the demo app. The advantage of Elasticsearch cluster is in the distribution of tasks and indexing as well as searching across all the nodes in the cluster.
This made it easier to reuse the results and helped us avoid the repeated algorithms’ execution, which greatly improved the response time and user experience.
We took responsibility for the work we did. We actively participated in the monthly meetings and sprint reviews, where we presented our progress and completed work. This strengthened our relationship with the client which turned into a successful partnership.
Kristina is a well-organized team player, always interested in making sense of data and ready to contribute to problem-solving.