It’s that ultimate assumption everyone makes. ‘I assume my data is normal‘ and I will have a complete mental breakdown if it isn’t and all of my methods are assuming normality. But, what do you do if your data isn’t normal? There is nothing I care about more than maintaining valid assumptions and your own mental health in extremely stressful times such as analyzing a strange data set.
Here are 8 things you can do instead of violently destroying your computer and burning all of your project notes in a fit of rage when such a question comes while shouting ‘bUt CeNtRaL lImIt ThEoReM!’
1. Go for a walk and pick a normality test
So you’re getting weird results from all your statistical tests or you’re freaking out because your data looks dangerously and mellifluously Heteroscedastic! The first step is to take a deep breath, step away from your computer and take a walk. Walking relieves stress in so many ways and there is nothing more stressful than dealing with non-normal data. While you’re walking, you should think about picking a normality test.
Normality tests are an easy way of telling if your data is normal. Just because your data might look Heteroscedastic does not mean it’s not normal! Skewness, that’s another story. While you’re enjoying the many benefits of being outside and cooling off, think about the many different tests to consider using. There’s nothing wrong with using multiple tests and picking your favorite result as long as you pick the right one for the dataset size and dimensionality of data. My personal favorite is the Shapiro-Wilk Test for it’s high sensitivity, cause I’m just a sensitive guy.
2. Grab a drink with a friend and try and figure out how to transform your data
So maybe you tried every possible test and every one of them said your data wasn’t normal. Even though that walk gave you all sorts of rejuvenating energy, you might need to really vent to a friend who can relate to non-normal data sets. If you were planning on using an algorithm that needs data normality to function, you’re really gonna need to vent. Just remember to bering a pen to the bar or coffee shop.
Draw some shapes on a napkin and complain about all of the skews that you just wish were not there. Take that bar napkin you hopefully did not dab your tears with, and start sketching out a scheme to transform your data into something Gaussian! There is nothing wrong with arbitrarily shifting, multiplying, and power transforming your data with constants, powers, square roots, shoot you can even take the log of it as long as you transform it back. In my experience, this is the most fun bar activity there is with a close friend. You’ll turn turn the worst night of your life into the best!
3. Take a road trip to the Gamma Distribution
You can’t always find a perfect way to transform your data. This is when you need to take a journey of self discovery to somewhere you may have never been. That low priced tourist destination sitting at the bottom third of you bucket list may be called the Gamma Distribution.
Traveling to new places gives you perspectives on life and inferential statistics you may never get from comfort rewatching the Office or Muppet Treasure Island in your apartment. While spending time in your hotel room, or camp site on the road trip, you should try thinking from the perspective of the two parameter Gamma distribution which is incredibly flexible for highly skewed data. While it is the father of the exponential distribution useful for wait times, Gamma is surprisingly flexible when modeling and analyzing skewed normal data.
4. Hike a mountain and find the true source of your data
There’s nothing quite like hiking up to a snow capped mountain or a lake nestled between glacier carved peaks. While you use the exercise to recharge your life and collect enough outdoor selfies to prove on your tinder profile that you are in fact outdoorsy, think about where your data is coming from. Contemplate the mountain stream and how uniquely beautiful it is.
In a river at the bottom of a valley, that flowing water is the sum of hundreds if not thousands of tributaries. If those tributaries were additive independent random variables they would be normal like the data you wish you had according to the Central Limit Theorem. You are not some normal towny at that river, you are a unique snowflake melting into a mountain stream thousands of feet up, and neither is your data.
That stream might have five tributaries tops, and there’s no way it would satisfy the key assumptions of the central limit theorem to qualify as normal data. Trace back the tributaries! Think on your hike where your data comes from and model that process. By the time you hike back to your car, you will feel powerful, have a meditative understanding of your data and your calves will be too sore to walk away from your computer for days.
5. Try Yoga and the GLM
Maybe you’re not just trying to analyze one dimensional data and you’re dealing with multi-dimensional data. Just. Stop. What. You’re. Doing. and get into that downward dog. You’ll never be able to deal with the mental anxiety of dealing with data you may not be able to visualize unless you quit focussing on reps and start focussing on body control and breathing.
Maintaining a regular yoga routine will help you think multi dimensionally. Relax through tough multi-dimensional sets of data and meditate on how to choose your predictor variables for your Generalized Linear Model (GLM). In my personal opinion, the GLM is probably way more underrated and important than yoga is.
In case you haven’t realized yet, linear regression assumes your residuals are normal. With a GLM, all you have to do is pick the link function. Gaussian, Exponential, Gamma, Inverse Gaussian, Poisson, Bernoulli, Binomial, Categorical, Multinomial, GLM can fit so many useful error functions. You just have so many options, you will feel as powerful as you will in that Warrior two position.
6. Learn how to cook something new by adding ingredients together
Even those who live on the whole food diet don’t just eat raw ingredients. You shouldn’t eat that way and you shouldn’t treat your data that way either. Who’s to say your data is just one parameterized distribution. While you start following that new recipe, don’t think, I should follow this one and ignore all those other recipes I love. If you think you have that one spice you like to add to your Jambalaya and it won’t go well in your white bean chili, why not try it out, see how it tastes and average the spices together.
Just like how Tony’s goes good with everything, you can almost always mix a little Gamma or Gaussian into your model whether it actually belongs there or not. I know what you’re thinking, ‘you can’t just average models like food, you’ll end up with Cholula in a chicken noodle soup!’ The key is adding just the right amount and you’d be surprised how much a tsp of Cholula will go to that Matzo Ball.
In statistics, you can do this with Mixture modeling! Maybe your data wasn’t passing that normality test because it was actually a few different bell curves averaged together. Sometimes data analysis is a chili and you gotta throw in a lot of stuff to get it just right. You may not get that chili the way your mom made it right the first time without finally calling her once in a while. You might have to get experimental. Just use a good metric to figure out how to average your Poisson vs Gamma or Cumin vs Chili powder and your mouth will be watering with a deliciously informative model.
7. Seek Professional Help
If you’re still reading, and nothing has worked, and you’re about to grow a soul patch because you feel so inadequate, don’t do anything rash just yet. There is nothing wrong with asking for help. Many of the proposed suggestions can sometimes take a lot of creativity and it is always good to get a second set of eyes on your problem. Likewise, if you have a statistical mentor, they may be able to relate in a way that a high priced therapist cannot. “Oh that reminds me of a data set I had to deal with a long time ago.”
Nobody knows everything, even if you are at the top of your field. You will be surprised what methods you may have overlooked. Even your statistical therapist doesn’t know how to solve your problems, just talking it out to someone may give you the answer you need.
8. Accept yourself, the risk, and Blame the data
It might be an outlier, it might be model error, but at some, point done is better than perfect. It’s okay to blame data, as long as you assess the risk when assuming normality. If that risk is acceptable, move on and be that person your mom brags about to her friends. The important thing you have to do is accept yourself as a data scientist, or an engineer, or whatever unfortunate internship you signed up for this summer and say. “This model may not explain everything but it’s the best I can build, and that’s okay.”
That doesn’t mean that you are bad at stats, or coding. It just means you got really weird data that is beyond understanding like the Scottish Language. The best thing you can do is quantify the risk in your model and call it a day, or semester, or thesis, or government funded research grant. Even if you have to use a Gaussian model on your non-normal data because nothing fit that well and it was only a little skewed, being honest about the risk in your model is the most professional thing you can do. Just remember that all models are wrong but some are useful.
I hope this helped you in your desperate time of need and if you care as much about verifying your own assumptions and maintaining mental health as much as I do please like, share, and subscribe with your email, our twitter handle (@JABDE6), our facebook group here, or the Journal of Immaterial Science Subreddit for weekly content.