Richard Flemmings discusses the importance of training data in applying Machine Learning to the classification of Earth Observation imagery
It is a truism, accepted for more than 60 years, that the output of a computing process is only as good as the data that goes into it. Garbage in garbage out (GIGO) or, in the UK, Rubbish in rubbish out (RIRO) assumes that erroneous, out-of-date, inaccurate, or simply unfit-for-purpose data will, when logic is applied or analysis undertaken, produce incorrect or unreliable outputs.
Within the GIS sector this has never been more relevant, for data is being collected at an unprecedented rate: country-wide aerial surveys are gathering imagery at 12.5 cm resolution; national mapping programmes are capturing 500 million geospatial features, and some 700 Earth Observation (EO) satellites are imaging the globe’s entire landmass every day.
This wealth and volume of data - and potentially valuable information - sits alongside staggering improvements in computer performance that allows data to be automatically queried or classified. Artificial Intelligence (AI) and its subset Machine Learning (ML) are also making significant inroads within our sector and being hailed as a possible solution to potential data overload.
AI is the overarching concept of intelligent machines that can simulate human thinking capabilities and behaviour while ML is a subset of AI that sees machines learning from data and or experience without being explicitly programmed. Most importantly, especially within the field of GIS, ML is about optimally solving a specific problem. ML can also be used to achieve new levels of efficiency, productivity and potentially accuracy, although it should come with a warning!
Making sense of data
There are a number of subdivisions of ML with classifications including supervised learning vs unsupervised, and also different classification or learning algorithms. In simple terms, a supervised learning model uses algorithms that learn on a labelled data set before being evaluated against a training data set. Unsupervised models, in contrast, do not require training data - the model tries to make sense of the data by extracting features and patterns on its own.
For the purpose of this discussion we will consider supervised ML, i.e. a learned response based on examples of the correct answer. The quality and integrity of these examples, known as training data, are therefore critical if we are to have confidence in the process outputs. Let’s not forget rubbish in = rubbish out.
As already mentioned, there is a vast resource of EO data already available … one that is being continuously expanded and updated. However, this resource is only of value if information can be derived from the raw data. So how do we ensure that we get the best results possible using ML? Again, assuming we are using supervised ML, we need to show the computer what to do, and then we need to provide examples of what happens when this process is executed correctly.
Decisions, decisions
One supervised learning methodology for classification, regression and other tasks is known as Random Forest (RF). This operates by constructing a multitude of decision trees during the training process and generates an output based on the mode of the classes for classification and the mean output of all the individual trees for regression.
Having used RF to create a variety of data layers, recently launched as the Country Intelligence suite, 4 Earth Intelligence (4EI) has gained considerable experience in the use of ML to process EO data.
Although the RF method of Machine Learning has been used to good effect, we still need to be mindful of the learning process and the training data employed. So what is the optimal training data size? And what constitutes a sufficient amount of training data?
As you may expect, there is no easy answer to this question. The selection of training data – its source, content, coverage, currency, detail, accuracy etc. – is directly comparable to the outputs required. For example, for a small area requiring a more detailed classification, it may be possible to create training data from field surveys and higher resolution source imagery. However, this would be impracticable and potentially costly for, say, regional or global classification studies requiring continuous updates. The amount of training will also impact classification accuracy, with the optimum size usually related to the choice of classifier and the dimensionality of the input data.
The training data used must also be capable of producing the same outputs that are required of the finished product, thereby preventing mislabelling or classification problems due to pixel mixing. Multiple studies have suggested that a proportional approach can achieve higher overall accuracy, but it is important to know the proportions of the map classes prior to making the map.
The selection of training data for analysis should reflect the area that is being studied and the currency of that data; after all, it’s important to compare like with like, so there is no point applying training data from an urbanised area to one that is sparsely populated, or in using data from a similar region that is out-of-date.
Practice makes perfect
One advantage of the RF methodology for satellite image classification is that it is not sensitive to noise or overtraining as the output is based on the average outcome rather than a weighted value. Random Forest is also computationally much lighter than some other methods and there are multiple open-source implementations that can be utilised.
In addition, this workflow, after initial training, does not require a great deal of parameter adjustment and fine tuning, with the default parameterisation often giving excellent performance. Notwithstanding, as the outputs of 4EI’s Country Intelligence GIS suite can vary enormously depending on the study region, we continuously refine the selection of training data used.
So, in conclusion, it is really no surprise that the more considered the selection of training data, the better the outputs of Machine Learning will be. Training data should be relevant, current, accurate, complete, and all those other positive attributes we commonly associate with geographic data and analysis. However, it is also important that consideration be given to the desired outputs of the workflow. As this article started with an adage, it seems fitting to conclude with a few more that seem particularly apt, so, ‘you can’t make a silk purse out of a sow’s ear’, ‘practice makes perfect’ and even sometimes there is no need to ‘use a sledgehammer to crack a nut’.
Richard Flemmings is Partner and CTO of 4 Earth Intelligence (https://www.4earthintelligence.com) based in Bristol and with offices in Abu Dhabi, UAE. He holds an MSc in Geographical Information Science and has been involved in the geospatial industry for more than 18 years, gaining world-wide experience in delivering innovative and technically demanding projects. Richard is a Fellow of the Royal Geographical Society (FRGS) and a Chartered Geographer (CGeog GIS).