How AI can help find Corona virus-like pandemic in Pakistan?

By: Yasir Dil Nawaz Khan

Imagine a typical working day morning; You want to park your car outside the nearby utility store parking lot but the lot is full, with vehicles parked nose to nose. You wait in a long line, similar to what you would see outside during cricket stadium in Pakistan Super League matches. When you finally get inside, you see rows upon rows of empty shelves. You manoeuvre your shopping cart around frenzied shoppers, only to find that utility but the store is out of sugar, flour, and other daily necessary items.

The reason for this panic for stocking daily basic necessities is not because of government providing subsidies? The fear and panic as the coronavirus (known as ‘2019-nCov’ or ‘COVID-19) spreads globally and government can completely lockdown again in the month of Ramadan.

As of April 22, 2020, there were 11,940 confirmed cases with 253 in Pakistan from the coronavirus outbreak. The death rate of 0.2 % is alarming and with fragile health system we might face lot of consequences and the number could jump up. We know a couple of things about the outbreak in Pakistan, even though we have National Incubation Center which is working to find some best technologies to fight against corona, but we might lack in some of them such as the state of art Artificial Intelligence (AI).

We know that the spread is exponential as a recent simulation under different conditions created by media reports. So, containment is critical. On the virus ecosystem we should touch upon seven different areas where AI can be used.

1. The vaccination development, including the infrastructure, clinical trials and commercialization: AI is used to find the right vaccinations faster by analyzing prior ones based on similarity measures of protein structures.

2. The infection and spread of the coronavirus: This is an area we’ll focus on to understand how data and AI help us answer some of the critical questions.

3. Diagnosis and treatment at the health centers where machines use AI. Some chest X-ray scanning systems can automatically detect the virus using image recognition.

4. Post-treatment, which includes post-care and insurance payments, AI is used for faster payment processing.

5. Then we had the regulators and the government agencies that collect, make available and process data across multiple entities.

6. We then have researchers who use that data and other data for creating better drugs, analyzing the impact of medicines and so on.

7. Finally, there are the people who form the most important part of the healthcare ecosystem. They might have access to information for self-diagnosis, use mobile apps and interact with the ecosystem. One mobile app, for example, helps users check if they have the virus by feeding in some user input getting some data automatically about their location to rate them on a degree of risk. You can imagine a situation where if a confirmed patient’s mobile location is known at all times, then it’s possible to identify all the other persons that this patient came in contact with.

Setting up the data hub can be beneficial to analyze the trends and future approaches regarding and pandemic. The data related to COVID-19 can be obtained from different organizations working in the ambit of government. Talking about the types of data there are two main types of it. One is textual data and the cord 19 includes over 24,000 technical articles and data about adverse effects and such. The other type of data is numerical data that includes how the virus spreads and is being treated. Data is being added each day to these data sets. Let’s talk about the textual data. It’s impossible for humans to read through all the literature and extract critical information. So natural language processing or Natural Language Processing Technology, which is a branch of AI, is being applied to this vast data set to extract useful information about the virus. We can use NLP on this literature data to understand the protein structure, develop vaccinations faster, understand treatment options and targets, predict adverse effects, determine dosage and so on.

One of the latest algorithms for text, the processing is called Bird Open Sourced by Google. This algorithm overcomes the limitations of prior and NLP algorithms by looking at words and sentences from both directions left to right and right to left so that they can understand the word in its full context. Sentences are mapped to vectors or points in multi-dimensional space, but by giving more context, the vector falls closer in proximity to other vectors that convey the same or similar meaning. But for example, the word sick in the following sentences is mapped to two different locations because the meaning of the sentences is different.

I’m sick, so take me to the hospital is different from I’m sick of my boss.

There is an AI Technologist called Data Foundry, they build medical language processing application to answer such questions from text data using NLP. Other forms of textual data or social media data where people share and talk about the coronavirus, Google searches, Twitter feeds and so on. The unstructured nature of textual data is what makes advanced NLP a great tool to deal with them.

Now let’s talk about the numerical or semi numerical data. Semi-numerical means that we have a way of converting them to numbers easily. For example, gender such as male, female and others can be easily represented by numbers like 1, 2 and 3 respectively. Typically, this data is a little more structured. Data Sets from Johns Hopkins, Kaggle, etc. follow a structure and you can think of them as a simple table. Most tables though are rolled up and data is not reported at the individual patient level. But the more detail we have, the better we can apply Machine Learning Algorithms on top of them. As an example, one of the tables has the following columns rolled up at the county level. The name of the county, it’s latitude and longitude, number of identified cases. The number of deaths and dates of these occurrences. A more granular data set at a patient level, there’s one available from China, includes additional information such as the patient’s symptoms, other medical conditions, their age, gender and so on.

In the former case, since we have country information, we can also combine this with demographic information and perhaps weather information from other sources to make new inferences. Data Robot, a company in Boston, did just that and concluded that based on initial data, the virus seems to affect more affluent people first because it’s likely that they can afford to travel more. Now that’s an interesting find. The essence of this table is that each roll becomes a point in multidimensional space and all machine learning algorithms look for patterns in that space. With this data, we can do two things that machine learning is good at. One is classification and the other is prediction. With classification, we try to find clusters and to answer questions about patterns. For example, what are the characteristics of people who initially got the disease or what are the locations that are more susceptible to the virus or what ages of people die disproportionately? With prediction, we can try to predict the spread of the virus over time as we can possibly put stronger mitigation measures in its bats or estimate which health centres will be overwhelmed so that we can be prepared ahead of time with better processing and supplies for them.

As with any machine learning algorithm, we find that some factors are more important than the others in slowing the spread. For example, we’ve been asked to maintain social distancing, clean possibly contaminated surfaces like doorknobs with disinfectants, stop gathering together in large sets, reduce travelling and so on. Among these, which ones are more effective than others? Relatively speaking. If we had historical data on a previous outbreak, we might apply a technique called the principal component analysis on the data to figure out which measures actually matter the most.

As we build models for classification and prediction, we also need to consider other things, new data is coming in at a much faster rate, which means that the model has to be constantly refreshed. If we find new attributes for the data, then the model has to be rebuilt or even different types of models have to be trained with the new data. So, in this Coronavirus context, the model is only valid for a short duration because external factors continue to change on how this disease spreads and is treated. For example, with more awareness about social distancing, the original project projections and predictions of the model could be wrong. That’s why companies are continuously refining and rebuilding their models every day with new data. Also, models that work in China may not work in Pakistan because of different conditions. Likewise, a high-level country model will not be useful to make local predictions. We will need to go in a little bit deeper in one part of the spectrum of the full possibilities.

We wanted to merge Artificial Intelligence and Data Intelligence together to bring you the whole picture so that we can dive deeper into areas that we are specifically interested in and perhaps contribute to the mitigation of the virus.

The writer is a Digital Media Expert and Mobile Journalist by Profession, building campaigns and communication strategies for development sector since 2011. He can be reached at www.facebook.com/yasirdil