Episode 1: Sole Galli

Wed, 19 Aug 2020 21:07:52 +0200

Award-Winning Data Scientist Sole Galli talks about her journey from research into Data Science, discusses the challenges of feature engineering, and gives advice for both new and more experienced practitioners. 

Episode 1: Sole Galli
--:--
--:--

Show Notes

Sole's Website: https://www.trainindata.com/

Transcript

[00:00:00] Hello, and welcome to the CourseMaker podcast today. I'm delighted for our first episode to have Sole Galli here as our guest Sole is a LinkedIn top voices 2019 in data science and analytics. And she and I have. Worked on courses together before Sole has made some significant contributions to open source, which we can get into in our conversation today. And I'm just really excited to be able to have this conversation. Hi Chris, thanks for the introduction and thanks for the invitation and looking forward to our chat. Me too. So I guess what we, what we're doing here is just trying to learn a bit more about you. And give your students a bit of backgrounds [00:01:00] and hopefully some inspiration and actions that they can take on their own learning journeys.

So to begin with, could you just tell us a bit about your background? Yes. Sure. I studied biology actually in university. And after that I did a lot of the, a lot of years of hardcore scientific research. I did a PhD in Argentina where I studied redox metabolism and its relationship to separation. Then I moved to Germany where I pursue a postdoctoral studies on. Microscopy following fluorescent proteins within the cell. Then I moved on to UCL where I continue research studies again, molecular biology, cellular trafficking and stuff. it was very exciting years, but at some point I, I, I decided to move on and I jumped into data science. Okay. So you had a significant

[00:02:00] research career before you moved into data science. Yes, I did. I had the, the lack of doing research in different countries and some different topics. I got to learn a lot of very exciting people. For my statistical skills that were actually paramount to jump onto data science. Yeah, for sure. and so when, when you decided to, to make the switch, what made you choose data science?

Yes. Well, I guess during. My academic research. What I enjoyed the most was copy and analyzing images and what you have to go to do that. It's actually a little bit of programming. We use one, two days called MATLAB because images in essence are. Matches of colors and points. So I jumped a little bit into programming and I saw that it was a lot of

[00:03:00] fun. And also the part of analyzing data that I liked the most was actually held to yeah. Statistical experiments and so on. And so for me, it was mostly like one plus one makes two. I mean, I like stats. I like programming data science seem like the obvious choice. So that's how I stepped in here. Yeah. Okay. And how did you go about making the transition? What, what did you study? What did you need to prepare in order to make that, that career transition? Yeah, for me, I actually have to. Learn quite a bit of stuff, because even though I knew a lot of stats, it's not necessarily what you use it at clean data science on machine learning.

And also, even though I programmed a little bit in mat lab, my program is very, very low. and also my love is not the language that is very famous and the data science community. so I have to learn pretty

[00:04:00] much everything I had to learn first, the program. And that was my first language of choice. Then I learned how to program in Titan. I had to learn a little bit of SQL and SQL. I learned mostly on the job, but are I learned before I left academia and got actually my first job. And I also had to learn a lot about machine learning and machine learning algorithms. So. Yeah, that was the, I was going through work during the days. And then I studied a lot in the evenings to kind of get up to speed with what I needed and to learn. I have to say I use these wonderful resources that we can find online nowadays in these massive online phones that are grades and they'll make knowledge available to pretty much everybody on some cases for free, in some cases for a very low fee. So knowledge is out there for those who actually want to get it. So I think it's wonderful. That's how I studied at the beginning. And then

[00:05:00] one never loses it's all habit. So I jumped on to scientific papers. Those that basically discover or kind of design, if you want decision trees, random forests on some other ones and then some books here. Yeah. Yeah. I mean, Sounds like a, quite a bit of work making your, your transition. How, how long would you say you were doing your evening study for, from the moment they decided that I wanted to change until I actually got my first job, I think it was only six months in, so it's, it's not that much to be honest, but then I of course continued studying later on as I was doing my work. I mean, Yeah. So it, a lot of things I learned on the job, a lot of experience that I got from there, and then also a lot of reading and sometimes I attended meetups and talk to colleagues. So

[00:06:00] the learning didn't stop when I got the job, but I think it kind of grew exponentially from that. Yeah. Yeah. I know what you mean. okay. And then let's, let's talk a bit about your, your courses. So, you have courses on feature engineering and feature selection, as well as courses that we've worked on together around a machine learning model deployments and machine learning testing. what made you decide to do a course on feature engineering? yes. Actually the main reason was that there is actually no course on feature engineering or at least at the time there wasn't any, and I found it very, very hard to find out how you can pre-process your data in order to use it as an input for machine learning models. What I noticed because I was taking so many courses is that they are very, very good to help people give those first steps. Then your first

[00:07:00] job, if you want, you know, that, that is like a little bit yeah. The fit gap with what comes next. I mean, once you're there, you don't really know what best practices are, what, how things are done within an organization. How things in an organization differ from data science competitions. the knowledge, if you want, can be out there, like in blogs, in some scientific articles, some organization publishes the white paper, but it's very, very hard and very time consuming to find it. And because I did that in order to engineer my variables for the first model that I made, I thought that. It would be a great idea to put it all together into one course and make it accessible for everybody. Or at least everybody who wants to learn more about how to transform their variables. So that was what got me started with feature engineering and the same was true for feature selection. Yeah. so it sounds like when you were

[00:08:00] first figuring out feature engineering, you were looking, you said in blogs, some scientific papers. How long did it take you to sort of get your head around it and figure out what you needed to for feature engineering? I have quite a bit of knowledge from doing it. I was working. I mean, I did all the reading and they kind of knew what was going on by the end of it. And so then I decided I would put it together in a course and then to make the first course and kind of shape it up a little bit. I would say it wasn't another four months. It was my first course. And I think, Yeah, it was, it was good content. It was not of deliveries potentially. I didn't have a lot of experience on how to put it, but yeah, an engaging course together, but fortunately the students gave me a lot of feedback. And then last year actually I made the second

[00:09:00] version of the course that I think is much better. And he. As things evolve too quickly in technology. So it's not that you make one course and it stays there forever. At least I like to update it and keep abreast of what's going on in the world. So I included the new libraries that are coming up. And as I was developing the course, I kind of noticed that it is a little bit hard to engineer your variables in a way that you can actually have it almost ready to deploy a model in production. I felt that there's a little bit of a technological gap there. That's why as a tool to accompany the course, I created this open source pocket feature engine, which kind of addresses this problem. It allows. The user to transform the variables in a way that they can either continue doing data analysis because this package returns Panda's data frame, and therefore you can continue exploring your data as you want. But if you want, you can also

[00:10:00] use it within a pipeline that is ready to deploy. So yeah, that's how this package was born. And I continue basically learning more and more about feature engineering and whatever I find interesting and exciting. I just started through the course. Yeah. I mean, I can definitely second, the feature engine is a great package and for sure we'll link to that in the show notes for anybody listening. Who's not encountered it before. interesting news actually about feature engine that, it's, it's now going to be included in the, stack. Is that right? In it is going to be included in psychic learns related projects page. So great feature engine now features among other projects that are kind of related to second learn in, in some way, particularly in functionality. And now it's also part of the Anaconda distribution, so you can install it with as well. And the last couple of

[00:11:00] weeks there was. A lot of contributors are actually helping me enhance feature engine functionality. So I'm very excited about that. And I'm probably going to send some sharing online about how much they appreciate the work of these people. That's really great news. Congratulations on that. Awesome. Okay. Well, if we go back to your, your courses, how would you say that your, your teaching approach differs from. Similar or just other data science courses. And yes, that's an interesting question. I don't know if my teaching approach differs. I think it's mostly about the content. like I said, I try to make courses that. A feeling gaps of knowledge. So my courses, and I think our courses because deployments and testings, and also not for beginning, just try

[00:12:00] to basically bridge that gap. So we've gathered a lot of experience on the job, on how things. Are done in different organizations through our experience with talking to colleagues. And I try to put all that into the courses so that the students or the users get a good feeling of how things are done within an organization that is looking to use those models live ideally, and what problems we have encountered and how we can go around them. So there's a lot of practical insight in them. Plus the fact that teaches the courses and the content is, is, is not necessarily for beginners, but it's actually looking to help. People have already gave given the first steps. And then what I think is kind of unique departments as well is the fact that we have created, okay. Open source to kind of support the course. And, but now it's taken off. So now it's those beyond the course, but yeah, I think those are kind of the most interesting bits of other courses.

[00:13:00] For sure. For sure. and what do you think is, or one or some of the key things that students need to know to improve their understanding and capabilities with feature engineering? to me, what is critical is to try and understand what's the transformations are doing to the variables. Why we do those transformations on how they are connected or how they will affect the performance of someone's for example. And I think it's, this is Gaily my opinion, but it's not only important to know that, for example, imputation with the mean exists, but what is that effect on the variable and what is. How is that going to affect the performance of the model? I kind of wondering, I'm thinking like, should they

[00:14:00] use this technique or that technique, if this is the more than that I'm going to build. So yeah, I would say critical thinking and knowledge derive decisions on how to pre a beta are going to take us a long way. Yep. And on the flip side of that, what are some of the most common mistakes? That people are making when it comes to feature engineering. I think it's, yeah. As you said, it's the flip side. I don't know if these are mistakes. I don't like talking about mistakes and doing things well or, or wrong. I think that if we don't think why we're doing what we're doing. We end up producing mothers on transformations that will not return the best performer months. And then we can

[00:15:00] leave some benefits on the table or contrarily because we need to improve the performance. We are going to spend a little bit more time, kind of trialing an error, which we could have avoided if we gave some critical thoughts on how we want to progress things and how we want to move forward. Yep. Makes sense. Okay. I'm going to ask you a little bit about more general career stuff. for any students who are thinking about how they can grow their, their careers in data science. so for people we want to get into data science, what are some of the, the mistakes people make, who wants to get into the field and what are some. what's, what's some advice you'd give to people who may be just starting out and want to get into data science. Yes. So, interesting question. And they decided it's very, very trendy these days, and there's a lot of demands, certainly

[00:16:00] for kind of people who want to do it analysis on machine learning. I think it would stop in point would be to think whether. I am going to enjoy doing that. Like getting familiar with what data science is, what am I going to, what are going to be my daily tasks? And do I really want that? Do I really like it? And if the answer is yes, then maybe this is the first step to move forward. But if we're not going to like that, then potentially that's not the job for us. and yeah, like. My advice would be to know programming kind of when in art a little bit of SQL. Yeah. It needed to be to get a lot of machine learning and algorithms and stats. We don't have to learn all at once. This is an ongoing process,

[00:17:00] but whenever we learn something, I. Good recommend highly to know it well, because the more we know something, the more resourceful we are when the problems come so we can know very high level what a random forest does, and we can use it because with cycle land today is import random forest classifier, boom. You have an algorithm, but if we don't really know. What around the forest East, what it does to the features, how it's Alexis, how it makes the decisions. Then it's very hard for us to understand which feature is important. How many features I have to use? What happens if I use a lot of features or too few features? So, yeah, I think it's about knowing it well, So that we are more resourceful to resolve the problems that

[00:18:00] may come, and I think they will come the problems. and that's why I say that the beginning made sure that you really like it because it will, it will help you with it, I think. Yeah, definitely. Good advice. and then I guess what we're talking about, there are people earlier in their career, what about. Data scientists who maybe have been doing the work for a little bit longer. they're somewhat along in their career and looking at somebody like yourself and, and thinking, how do I replicate some of your success? what, what advice would you have to those. Yes. After, after you kind of started, then there are a lot of possible directions that we are going to follow. And like data science is about analyzing data and producing machine learning models, but then we need to deploy the models. So a little bit of software engineering skills are very useful.

[00:19:00] I would say I'm very high minded and I personally find it. Quite exciting. So the aspect that I like a lot, even though, like, I, for example, here, they want to be an expert like you are Chris. but there is also another possible road, like data engineering, which is basically how we get our data. Do they decided this stack if you want so that they can actually pre-process. So I think the time comes when we can, we kind of have to make a decision what we like the most and then. We kind of specialize in that base, like veto software engineering skills, via data engineering skills as being more than monitoring skills, maybe we want to become the technical leaders. So then we need to learn a lot of people skills and managing skills, I think. Yeah. So that we can follow and yeah, it's always continue learning and learning drives that answers the

[00:20:00] question. Yeah, for sure. it's interesting that you say about the, the people's skills and managing, I wonder, what do you think makes it like a successful data science team? Cause sometimes, sometimes data science from, for me looking at data scientists I've worked with, it can seem like a little bit of a lone Wolf sort of activity, but, what, what are your thoughts about what makes for a successful data science team? Yeah, that's a good one. And I'm not too sure that it's one single answer, but I think it's, I think it's important that the team is actually a team with several people and not just one, I think critical thinking mass is important and if we have different. People in the team with different expertise that enriches the team because one may know a lot about supervised learning. The other one may know a lot about babies. And then that basically makes the team very rich

[00:21:00] in terms of knowledge. Then it's also extremely helpful if not unavoidable, that we have one person that or more ideally that are able to reduce the technical knowledge to the lay audience, because ultimately the data scientists are not the ones that are going to use them. Other, other people are going to use them on. Critical that we understand how other people are going to use our mothers, what they wanted for what they want to achieve. And then we work together to basically produce the first, the best solution for that. It's important that we understand their job, like what they need, and also that they understand how we can help them. So I think that's sort of important. And then. Communication with the software developers as well, because they also have the requirements and we can produce hopefully something that is already very

[00:22:00] close to their requirements. So then we can also reduce the load from research to production. So I think, yeah, it's, it's a little bit of everything like variety in, in people and skills. I think it's one of the most important things. Absolutely. Yeah. Okay. Well, look, I mean, this has been super interesting Soli, I guess just to close things off, what's what's next for you? Or what are you working on now? I think I'm developing new courses at the moment, I'm working with two colleagues that have managed to get excited or creating new courses. So, yes. that's what we're doing that he said through more courses in the pipeline. One is probably to come up. Before the end of this year, the other one is probably first quarter of the next one. Yeah. and then I'm also constantly updating feature engine and updating. Now it's time to update feature

[00:23:00] selection. And I'm also applying, I need to write a book on feature selections, and hopefully it's also going to come up before the end of this year. You're always busy. That's that's really exciting stuff. Sounds like, lots of interesting things coming along soon. where can listeners go to find out more about you and your work? We have a website is www.training data.com. There is the best way to find out about our courses, our books, a little bit about ourselves and for sure all the links to all our relevant resources or repo hosts feature engine on GitHub. And we can also be found on LinkedIn. Great stuff. Well, I'll make sure that all those, links and everything will be in the show notes. So with that then, sorry. Thank you ever so much for sharing your story. I'm sure that our listeners will really enjoy to hear, some of

[00:24:00] those details about the journey you've been along. I certainly did so. Thanks. Thanks very much then. And bye for now. Bye bye.

View all podcasts