There Can be No True Scottish Speech Recognition System

Maggie Reid1, and Dr. Gregor MacGregor2

1 Department of Interracial Computer Science, Cramberry-Lemon University, Pittsburgh, PA, USA

2 Department of Linguistics, William Wallace University, Edinburgh, Scotland

Abstract

With the advances of machine learning and neural networks advancing the fields of natural language processing, many applications are being developed every day across the world to interface man and machine through Speech Recognition Systems (SRS). Studies have shown that modern day voice recognition technology not only struggles to understand the Scottish accent, but it consistently fails to create one understandable to the local public of Scotland. A novel neural network has been developed to create an interface between machine and Scott. Additional nodes and  percussive layers were added to the neural network allowing for the language processor to extract information from mumbling utterances normally non differentiable to most human ears but extremely meaningful to Scotsmen. While some Scotsmen were able to interface with the network when using regionally specific training data, it was determined that regions differed from pub to pub. A further analysis of the accent revealed that the Scottish dialect outside of one town is non convergent and it is mathematically impossible for man or machine to create a true Scotsman. 

Keywords: Speech Recognition Systems, Natural Language Processing, Scottish Accent, Voice Recognition

1. Introduction

You don’t have to sing ‘Auld Lang Syne’ on New Years to know that the Scottish accent and language is one of the most unique and diverse ways of speaking on this planet. Despite the many advances in Natural Language Processing (NLP) technology, countless challenges have prevented the development of a commercially viable Scottish accent or language model let alone a SRS. The dialect is filled with colorful phrases, irreplaceable expressions, and an unpredictably vibrant musicality unmatched to this day. To non Scottish English speakers, it is easy to understand the heat of the moment when conversing with a Scott despite the actual words being impossible to decipher. It is a human accent easily translated to rhyme but not reason. MacDonald and Brown attempted to bridge this gap using back and forth interactions to find clarity [1]. Those iterations of Scottish Speech Recognition Systems failed by requiring a Scott to speak like a machine. This study showed that the key was to train a machine to speak and listen like a human. 

1.1 Speech Recognition System Challenges

SRSs have many challenges including; local dialects, spontaneous speech, unconventional grammatical structures,   words, countless phrases that mean different things, mumbled words, a rapidly changing pace of speech, and accented words to show different meanings and connotations. The Scottish accent suffers from all of these issues. An even more challenging issue was discovered in a test which showed that after multiple iterations of speaking with an SRS, many Scottish speakers in frustration began attempting to misuse normal english phrases which made even the most well tuned SRS struggle to converge with a convergent solution. Adding more complications, the microphone DSP also began clipping the increased volume of the human subject. These challenges and more were mitigated with novel additions to a traditional artificial network architecture commonly used in most Natural Language Processing (NLP) techniques. 

2. Natural Language Processing Architecture

Four additional layers were added to the NLP to successfully interpret the Scottish accent as shown in figure 1. First to prevent the network from being bogged down, unnecessary cursing was filtered in an expletive filter. Second, the remaining sounds and meaning was broken down into n-factorial or less combinations of tokens using the Percussive layer. Finally, the NLP adapted to the speaker using a regional based Multimodal learning layer. If all of that fails and the speaker is detected to become frustrated, the algorithm adjusts while interacting with the speaker by employing an Emotional Backwards Propagation technique which optimizes parameters of the initial three additional layers.  

Figure 1: Scottish Accent Speech Recognition System Neural Network

Figure 1: Scottish Accent SRS Neural Network

2.1 Expletive Filter

A recent 2018 study by Smith and Adams determined that in an adult Scott in 2007, 25% of all spoken words were expletives. In 2017 the alarming study showed an increase to 31% [2]. While including the cursing in the Artificial Network will not make processing impossible, it was determined in initial experiments to increase processing time 55% in nominal cases, and 255% in cases with malfunctioning Emotional Backwards Propagation. The expletive filter first detected unnecessary words and then assigned a numerical value estimated by volume and deviation from the speaker’s normal set of curse words. That numerical value would then trigger different accented phrases processed in the later layers of the NLP network.

2.2 Percussive Layer

The most difficult step of translating the Scottish language in the network was tokenization. Determining tokens is a challenge in every language but it wasn’t until a Percussive to Token activation function developed in [3] by Reid and MacGregor that the ever fluid rhythm of the Scottish accent could be modeled and interpreted into tokens usable by a typical neural network. The Percussive layer worked in two steps. First buffered DFT’s detected impulses normalized to the speed of speech. Once the impulses were separated into small bites of text through the mumble to character transform, syllables were able to be formed. The second more tricky step turned those syllables into tokenized words. Described in detail in [3], a series of bayesian forking trees created factor graphs to create a score which determined the optimum grouping of each of the syllable bytes. For initial training, subsequent groupings were kept in a buffer to find the highest scoring output past the remaining layers.  

2.3 Multimodal learning

Discoveries shown from the data collection phase described in this paper showed that the only viable path for the NLP was a multi-modal approach. Due to the regionality and personality of the accent, different modes were activated using the nearest Pub wifi. Once the nearest pub was determined, the most optimal network was selected. In the countryside away from town centers, the algorithm would be defaulted to a learning mode. Though limited, learning occurs much more efficiently in the calm green rolling hills of the highlands!

2.4 Emotional Backwards Propagation

The Emotional Backwards Propagation technique was a new development for SRS’s. Practice had determined that it was completely necessary for correct learning and civil, non-violent human to machine interface. Information from the Expletive and Percussive layers normalized by typical user speech to detect heightened emotional states. Given a higher emotive state, the SRS would reply with a “Keep the heid” which roughly translates to “stay calm, don’t get upset”. After the user continued interfacing with the SRS, the Expletive filter and Percussive layers were adjusted to account for the heightened state until lowered. At lower emotive states, active listening phrases were fed back to the user such as “Aye ye Dinnae See” which was typically understood as “You don’t say.” The addition of the backwards propagation technique was the only way for the network to converge to individual users. 

3. Data Collection

With any proficient SRS, an enormous amount of data is required and the amount of data accompanied by contextual annotations can vary with languages of different complexity. For example, Mackabee et al. developed an efficient and responsive SRS for the American English using approximately 1.55 million minutes of annotated conversations recorded in California coffee shops [4]. Unfortunately, due to the regionality of the acquired data, the SRS was only 20% accurate in the South Eastern and North Eastern United States. Similar variability in data requirements were encountered in groups developing an SRS for German, Austrian, French, Sassy French, and Klingon which required 2.01, 2.05, 1.8, 12.1 and 0.2 million minutes of annotated conversations, respectively, to develop SRS processors with accurate call and response in >90% of tested scenarios. 

We initially collected conversational scottish from four different locations including a coffee shop, a city park, a football match, and a pub. Unfortunately, we encountered several limitations to data acquisition. We selected Mick’s Shack, a local coffee shop in Greenock, and were quickly met with deep suspicion from the patrons and were unable to collect any usable audio. After being called a ‘fookin bampot’ we were readily escorted from the grounds and had to search elsewhere. We then sought to record audio at the Dunbeth Public Park in Coatbridge and initially found some success as we had chosen a particularly refreshing afternoon and many families were picnicking together. Unfortunately, after merely 30 minutes of recording, a large proportion of the participants were inebriated as well as frustrated by a recent staggering loss by their own Albion Rovers, broke all of our equipment and chased us from the park after being accused of being a ‘coigreach’. 

That’s when we had to get creative. A Freedom of Information Act (FOIA) request was then submitted to the UK’s counter terrorism division for all conversations in various pubs in Scottland. Not only was the audio data vast but a large proportion of the data was annotated by a special intelligence department set up shortly after the Bishops’ war to keep an eye on the rebellious North. With the 2.8 million minutes of annotated conversations in between 2015 and 2020, we were able to begin training the SRS. 

4. Network Training

While the vastness of the data gathered from the FOIA request did greatly assist in development, the lack of quality control caused many issues. Many of the audio recordings were corrupted by environmental noise at each pub and made machine learning very slow and challenging. We attempted to manually choose the usable audio logs but listening to the data manually to weed not only was time consuming but difficult to staff. We employed a small army of unpaid interns to verify the annotations produced by the counter terrorism experts matched to the audio data. Even with a script the interns were unable to match the odd Scottish spellings to the fast paced conversations with any usable accuracy. After a month of spin up time, our most talented American interns would be able to verify at most five minutes of conversation an hour over the course of the work week and those interns began demanding compensation. We only had nine hours of usable data by the time summer break had ended.

The solution to the quality control problem was as simple as it was elegant. A cursory analysis of the data showed that 97% of the unusable data occurred during fierce football matches and happy hours. We assumed all football matches were fierce and scrubbed through social media posts and football schedules and were able to clear out the audio logs of the unusable data with a simple web scrubbing script. Now it was just up to the SRS to process each Pub’s worth of data. 

At this point in the learning process is where we determined that a multi-modal learning process was absolutely necessary. Experiments were performed feeding in uniformly distributed random samples of audio into the SRS but the network would never converge as each audio segment would become its own local minimum. That’s when we began training the SRS on individual pubs and towns to find that we could achieve convergence with a slow enough learning rate and a less diverse data set. 

5. Analysis and Results

Once the SRS had stabilized, we collected further data live with real Scotts through double blind Turing Tests. Scotts were asked to have a conversation with either the machine or each other over a phone call and were asked whether they believed the voice on the other end was a real person and if it understood what they were saying. This was accomplished using predetermined dialogue for an attempt at repeatability. This desired repeatability became a pipe dream as the conversations were constantly altered by the emotional back propagation. 

The double blind test showed that Scotts were able to tell that the SRS was a machine and did not think it knew what they were saying 85% of the time. This result was discouraging until we also measured a 90% chance of a Scott reporting another Scott as a machine which ‘Didnae ken whit a’m saying’. While finding the true Scotsman or Scotsmachine was becoming elusive, the machine was outperforming the human by a slim margin. The large variance in results however did show that the margin was not statistically significant. Therefore, the marginal average increase proved the pursuit inconclusive. Starting small, this successful iteration was developed and verified at Macpoyle’s pub over weeks of iterations and tweaks to parameters in the SRS. 

In search of a viable product, the development team continued to advance the maturity of the SRS at more pub’s across the greater metropolitan Edinburgh area. Similar results were recorded at the Horseshoe bar. Towards the end of our dialect expansion within the Horseshoe bar the SRS began taking longer and longer times to adapt to each Scott. We began noticing the increase after successfully interfacing with fifty Scotts. This drastic increase in learning rate grew unmanageable as we expanded development another block to Marlow’s Tavern. 

Figure two below shows the stark increase in learning times. The vertical axis shows the required amount of training time to achieve more than 90% accuracy in the SRS. The horizontal axis shows the number of Scotsmen interpretable by the SRS with marks showing the number of drinking establishments covered by the system. For the first two locations the convergence time, once stabilized, remained manageable for most Scotts. Once we reached a larger coverage of pubs at Marlow’s and different clientele, the learning time became increasingly unusable for a commercial system. For the sake of science and too much pride to give up, we continued until we began looking into computer clusters for remote processing. 

Figure 2: Learning Times by Geographic Area

Figure 2: Learning Times by Geographic Area

After reapplying for additional neglected piles of flaming grant money in desperate need of spending, we took a step back and worked on configuring the SRS for Marlow’s Tavern in the same manner as MacPoyle’s. With our developed experience and intuition developing the algorithm we were able to achieve the  baselined 85% turing test failure rate performance. In due diligence, we made another data gathering trip to MacPoyle’s and the Horseshoe Bar to recreate the previous results with a reasonable learning time. There we found the same issue from the first configuration. We couldn’t stop there, we had over a year’s worth of funding. It appeared that the multi-modal states were saturated by personal accents.

Over the course of a year, we repeated the same results across a combination of two dozen drinking establishments, seven coffee shops, and three public parks across all of Scotland. With over twenty five samples and a blind devotion to our personal interpretation of the central limit theorem, we concluded that our consistent results were law.  According to this new law it is impossible to develop a consistent framework to interpret the accents of the Scottish language. However; it is possible to approximate the accent if scoped to less than three small communities.  

6. Conclusion

Many poets and historians from ages past to the modern day have described the Scottish people and their land as a unique and indescribable culture. This paper has proven that they are without a doubt indescribable. The success of a consistent and successful SRS for the Scottish accent is possible if scoped correctly, but will always require many man hours of work developing and tweaking parameters on the SRS neural network itself. 

In a last ditch attempt, we trained our most successful SRS configuration using over one million minutes of annotated audio data and asked it what was a true Scotsman. The Network proceeded to answer. 

  Ae weet forenicht i’ the yow-trummle

            I saw yon antrin thing,

            A watergaw wi’ its chitterin’ licht

            Ayont the on-ding;

            An’ I thocht o’ the last wild look ye gied

            Afore ye deed!

            There was nae reek i’ the laverock’s hoose

            That nicht–an’ nane i’ mine;

            But I hae thocht o’ that foolish licht

            Ever sin’ syne;

            An’ I think that mebbe at last I ken

            What your look meant then.

We still don’t know what it meant, but we did determine that the SRS just recited Hugh MacDiarmid’s The Watergaw. The open source online interpretations of the poem did not help and we still do not know if there is any True Scotsman. 

Acknowledgements

Thanks to all of the unpaid interns, Mary Brown for the discount beer at Macpoyle’s, and Limmy McDouglass for the competitive dart games while people talked to the SRS. 

References

  1. MacDonald G and Brown F C 2015 An Interactive Approach to Scott to Machine Translation :: Journal of Highland Research
  2. Smith A and Adams A 2018 A Comparative Study of Cursing in the Modern Scott :: Journal of Scottish Culture  
  3. Reid M and MacGregor G 2017 Percussive Modeling of the Scottish Accent :: Interrace Cybernetics Journal
  4. Mackabee J et al. An Unnecessarily Bayesian Approach to Speech Recognition Systems :: Journal of Corporate Silicon Valley

If you enjoyed this article please like, share, and subscribe with your email, our twitter handle (@JABDE6), or our facebook group here for weekly content.

Published by B McGraw

B McGraw has lived a long and successful professional life as a software developer and researcher. After completing his BS in spaghetti coding at the department of the dark arts at Cranberry Lemon in 2005 he wasted no time in getting a masters in debugging by print statement in 2008 and obtaining his PhD with research in screwing up repos on Github in 2014. That's when he could finally get paid. In 2018 B McGraw finally made the big step of defaulting on his student loans and began advancing his career by adding his name on other people's research papers after finding one grammatical mistake in the Peer Review process.

One thought on “There Can be No True Scottish Speech Recognition System

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: