Personal reflections on data science jobs
A bit more than a year ago, I took the plunge and left my academic job to try my luck as a corporate data scientist, first at IBM (obviously a very big company) and now at Peltarion (a startup which I still want to call small although it is growing rapidly). I am not sure if this blog post is premature or not, but in any case I’d like to share some of my experiences and impressions of the different roles I’ve been in. So without further ado, I present my three last data science positions!
(1) Bioinformatics scientist at Stockholm university (May 2010- May 2017) + freelance gigs.
I was working as a senior bioinformatician at SciLifeLab/Stockholm University in different capacities for seven years. At first I was hired as a general bioinformatics go-to person in a so-called core facility that does DNA sequencing, where I would be involved in a lot of different kinds of things: setting up data pipelines, deciding on quality control routines, trying to figure out what had gone wrong, delivering data to and communicating with customers, performing routine or custom analysis, and sometimes doing some actual research and writing papers. After a while, I moved into a different role where my job was more explicitly to help researchers with data analysis, statistics and programming – more research-oriented and long-term work. In a way, I was an academic data science consultant. Of course, we didn’t really call it “data science” because we were doing science, plain and simple, but in terms of what we did all day, it was in many ways similar to “data science” in industry.
Characteristics of data science in an academic (biology) setting
Note: this is the type of role I have the most experience with, or the most data on, if you will, so I am more confident about the pronouncements here than in the other categories.
- The final product is almost always a paper. This has some positive and negative implications. On the good side, there is (at least nowadays) a strong focus on reproducibility. On the bad side, there is almost no emphasis on putting predictive models into production or making them easily usable. Code quality can also be spotty as a result.
- Bioinformatics data scientists tend to be good at data visualization, often in R or Python. They understand the concept of batch effects (drift in distribution parameters) and are good at dealing with high-dimensional data where the number of examples is usually much smaller than the number of dimensions (n << p), for example datasets with measurements of 20,000 genes for 20 different individuals. This makes it necessary for bioinformatics data scientists to be familiar with dimensionality reduction and multivariate methods such as PCA, PLS, t-SNE and so on.
- They like to use notebooks (Jupyter or R Markdown) to communicate analyses, because these have a similar structure to scientific presentations or manuscripts.
- They often like to use pipelining tools such as Nextflow, Snakemake or Bpipe to chain operations together.
During this time, I was also consulting part-time (usually less than 10%) for a few startups. From one of these gigs I learned to build very complex processing pipelines with Snakemake. From another, I learned to build obscure functionality for web applications in Shiny. These are both tools that fit naturally into a bioinformatician’s mindset. For yet another customer, I suggested a way to use PCA and MDS to view their data from a global point of view which they had not considered, guiding them onto a path that eventually resulted in this Medium article.
(2) Senior Data Scientist at IBM (June-November 2017).
After having been an academic consultant for quite a while, I decided to try to be a corporate one for a change. I got a position at IBM’s consulting arm, Global Business Services, in Kista outside of Stockholm. Since I was only there for six months, I only had time to participate in a handful of projects, which were mostly related to the manufacturing industry. Fortunately, the knowledge of high-dimensional data that I had from biology came in good stead when working on these problems. It was not difficult to apply the skills I had obtained from academia in this setting.
Characteristics of data science in a “big consulting” setting
Note: With my short experience, I have a hard time isolating out for some of the points below if they are true in general for consulting companies or company (IBM in this case) specific.
- Pragmaticism is the word to summarize data science in “big consulting”. There isn’t time to think through every wrinkle of a problem as there is (albeit in theory only) in academia. The end goal is specified by a contract which you try to fulfill as closely as possible within the allotted amount of time. Notably, your task is not to do as much as possible but to do exactly what has been agreed upon. There is almost always a trade-off between time and model performance.
- Consultants are good at giving effective presentations. One of the first things I learned was to completely rework the way I had done presentations in academia to more clearly highlight the important findings and tailor them for the management level in companies. Communication is a very important skill for a data science consultant; maybe the most important one.
- Like in academia, there is also not that much emphasis on productization, because that part will typically be handled by a software engineering team that comes in after you have completed a proof of concept (PoC), if that PoC leads to a longer engagement. On the other hand the IBM stack (see below) has good support for deploying models e.g. via NodeRed.
- (This part might be more or less company-specific) In my team, we did not make very much use of code version control with Github, for a couple of different reasons. Since we worked mostly with short PoC projects, it was more prioritized to find a promising approach in the allotted time, after which the software engineering team would come in to build the final implementation given a prolonged contract. Also, some of my colleagues worked mainly with non-code tools such as SPSS Modeler, which has its own built-in version control. We ensured reproducibility mainly through the version control mechanism in Box, where we stored scripts, documentation and metadata.
- Automated data cleaning and model building (AutoML) are important in this setting because of the time constraints. Data cleaning can yield big “quick wins” but is tedious and a lot can be gained by automating it, for example with packages such as vtreat for R. AutoML with TPOT, auto-sklearn or H20 is interesting for rapidly finding a good-enough model.
- Feature importance or other types of model explanation are very important for communicating results to customers (also see below). Decision trees are still used surprisingly often, and for random forests and gradient boosting, there is feature importance and various tree-model explanation interfaces.
- It’s quite common to encounter projects with unbalanced data and to use tools like SMOTE, ADASYN or ROSE to do smart oversampling of the rare class(es). It is also not uncommon that some classes are so rare that one needs to go for an anomaly detection approach rather than standard classification.
- (Company specific, at least in part) In terms of tooling, there was a larger emphasis on using commercial products (preferably from the IBM ecosystem) such as SPSS Modeler rather than open-source programming languages. Naturally, one has to rapidly become conversant with Bluemix (now called IBM Cloud) offerings and associated products in order to be an effective IBM consultant.
(3) Data scientist at Peltarion (Nov 2017-).
In the autumn of 2017, I got an offer from a deep learning company, Peltarion, that I had applied to before starting at IBM. I decided to take it on the strength of the skills of my new colleagues, many of whom I knew from the Stockholm AI and machine learning scene. As the company is a startup, I have worn many hats during the first six months, working in customer projects, writing documentation and blog posts, testing our deep learning platform, sitting together with beta testers, keeping an eye on competitors and so on.
Characteristics of data science in a startup setting: (surely not representative of all startups…)
Note: I suspect that the variance among startups is much higher than among academic groups or big consulting companies, so almost everything here is probably highly company-specific.
- (Possibly company-specific) There is more emphasis on software engineering practices and than academia or big consulting. Git and Github (or some equivalent) are not “nice-to-haves” but the core of the whole enterprise, and frequent pull requests and code reviews much more common. Virtual environments and containers (e g Docker) are important (though also found in academic bioinformatics to a large extent.)
- Data scientists in startups tend to think more about deployment and productization of models, because it hits closer to home (there often isn’t a supporting software engineering team to do that for the data scientists, or the startup is building its own deployment functionality, like we are at Peltarion).
- (Possibly company-specific) Startup data scientists tend to be more informed about the latest technical advances in machine learning. Consultants don’t have time to keep up as much (or to install and play with the latest tools) and academics are often more interested in keeping up with the latest scientific advances in their specific field rather than general ML news. It is also more important to keep track of competitors.
- (Possibly company-specific, e.g. Spotify uses Luigi) Reproducibility is achieved by writing libraries rather than chaining together operations with pipelines. Continuous integration (CI), like with Travis or Jenkins, is much more common than in academia, although it is starting to appear there as well. For us at Peltarion, CI is essential because we need to move fast and make every effort to minimize technical debt that could come back and bite us in the future.
I hope you enjoyed this highly subjective look at different kinds of data scientist positions. Feel free to ask questions in the comments section or provide your own views on different roles.