The Data Scientist's Guide to a Career in AI
Data science is pretty hot right now and it is no secret that many industries are currently benefiting from it. There is no question that it is important. This article aims to show why it is important by describing the tasks that are usually involved in data science exercises.
Data science endeavors should have the goal of empowering the data user (the client with whom we are dealing) so that they may be able to make proper data-driven decisions that have a positive impact on their business. Such positive impact is primarily in the form of greater profits, lower expenses and/or enhanced operational efficiency.
Such endeavors typically start by pointing out business pains (aspects of their business that have significant room for improvement) that the data user is having that can be alleviated with the help of data science. Data science can be useful to figure out hidden trends in the data that the business may capitalize on, find problems in the data and fix them, build prediction models based on historical data to accurately anticipate what will happen next, and so on.
In the following, a data science project plan is formulated to give you a glimpse into what a career in data science entails.
Let’s dive into what data scientists do in their day-to-day…
The A to Z of a Data Science Project
Every data science project involves the following steps: data collection, data wrangling, data analytics, and findings. Keep reading to get a sense of what each of these steps looks like.
1. Data Collection
It goes without saying that without data there can be no data science, so we need to find us some data first.
Bear in mind that the collected data would have to enable us to achieve the goal of data science that we mentioned earlier; that is, to help the data user make impactful decisions that would benefit their organization. This goal should be the north star that guides us as to what data we should focus on collecting.
If a suitable dataset is already available (either online or provided from an existing business’s database), you may go ahead and obtain that. Alternatively, one may scrape the data they need from the internet or even conduct a survey to collect the required data.
However, in the less likely case that a data scientist is approached to provide consulting for a business that is just starting and creating its data infrastructure, this data scientist needs to have foresight on how the data will be used to help the data user (Stanton, 2012).
This is because the data user will ultimately make use of future analyses on the data to make decisions that will affect the business, and it is the data scientist's job in this case to make sure that the data user will eventually be able to do that! In other words, for these critical analyses to be possible in the future, the data scientist must ensure that the necessary data elements are currently accounted for in the architecture of the database that will be storing all the data.
Now that the collection process is complete, it is time to wrangle these data…
2. Data Wrangling
This is the phase during which you will ‘clean’ the data and transform it into more accessible/desired formats for analysis. It is also easily the most time-consuming part of any data science project, as most real-life datasets are not ready for analysis.
Typical things a data scientist may encounter/carry out on a dataset that they are working with include:
Missing values in the data
Corrupt values in the data (i.e. incorrect stored values)
Joining data that is scattered over multiple tables
Engineering new features that are important for later analysis
Now that the data has been wrangled, it's time to analyze it to bits…
3. Data Analytics
There are different kinds of analytics here: descriptive, diagnostic, predictive, and prescriptive analytics. The key to all of them is curiosity. Try to have a bunch of questions in mind that you want to answer before you proceed and analyze the data. If you’re drawing a blank and can’t come up with questions, start with summary statistics of the dataset you’re working with and be curious about what you observe to decide what to do next.
Also known as exploratory data analysis, descriptive analytics entails exploring the data for any trends that lie therein, aggregating data from multiple viewpoints and plotting multiple visuals that enable us to further understand the data (think histograms, scatterplots and the like).
The scope of the data science project can act as a compass that assists the data scientist in choosing what data elements to focus on. This is especially if the data is wide; that is, the data consists of too many components to exhaustively analyze (think of a database that has hundreds of tables containing different kinds of information).
Similar to the above, but with diagnostic analytics, a number of techniques are deployed to observe and understand the reason for the appearance of an unexpected pattern.
More forward-looking, predictive analytics, with the help of a variety of statistical techniques, relays any patterns (problems or outcomes) that are likely to happen in the future.
This kind of analytics takes things one step further: assuming we know what ails the business, we attempt to provide the data user with suggested actions that, if applied, would result in a direct or eventual improvement in the key performance indicators (KPIs) of interest.
Analyses done? Time to show the fruits of our labor…
4. Presenting Findings
It's important to enter this stage while bearing the data user in mind. It is essential, for example, that the data user understand what you’re showing them. In the case that they do not, they are unable to formulate a solid business decision.
In this regard, possessing a good understanding of the business domain that you are working on is crucial. It also helps to ensure that you use your client’s own terminology in order to be able to adequately and clearly convey your message and findings to them.
Finally, never underestimate the power of data visualizations and their instrumentality in getting your message across in a way that is easily digestible and accessible.
This guide was brought to you by Dr. Ali Ezzat, Chief Data Scientist at Synapse Analytics. Dr. Ali currently leads our data science team and oversees data science projects within the organization. He received an MSc degree in Bioinformatics (2013) and a PhD in Computer Science (2018) from Nanyang Technological University, Singapore.