Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you for your comments! They are very insightful. To piggyback a bit:

Assuming you are a competent data "analyst" who wants to become a data engineer, how would you go about it? Is "go back to school and get a CS degree" the answer? I suppose this question is very broad, but I am curious if a practitioner like you has an opinion.

---

To give some context:

I recently graduated with a STEM PhD, and looking to move into data science. Reading the comments, I feel like I fall into the "pointless data scientist" cohort derided in this thread. Eg: I am very comfortable doing typical analytical work & occasionally training models inside a notebook, but I am neither a cutting-edge theoretical statistician nor a data engineer.

I've been trying to improve on the engineering side. For example, I did a project recently where I set up a rudimentary pipeline that continuously pings an API, uploads the data to a cloud database, then serves up the analysis via a Flask app. For me this was a big step up from just doing notebooks on a csv file :)

But moving beyond the basics, I am not sure what to study next. Hence my question. If you have any suggestions, I would greatly appreciate it!



We have colleagues who are similar to you, PhD. What we ended up doing is building an internal machine learning platform to reduce the number of "taps on the shoulder". They had trouble with setting up the environments, dealing with libraries, systems dependencies, etc. In addition to that, they relied on others to get data, fix their environment, deploy their models, or showcase their work to clients.

It wasn't optimal because we were having bottlenecks and variance: some people could move through the stack and do it all, but you either had them or you had to train them and it took time.

- [0]: https://iko.ai


I'm confused, do you want to become a data engineer or a data scientist? A data scientist is a type of senior data analyst. Data engineering is farther away to a data scientist than a data analyst is.

I'm going to answer both questions just in case:

To become a data engineer / infrastructure engineer, there are multiple paths forward. I recommend doing BI work aka Business Intelligence Analyst Engineer. It's typically Tableau related work, so making dashboards and reports for management. It's still a data related role so you should feel comfortable and at home. However, it is also an engineering role. If you're the first BI at a company you'll often find yourself setting up an SQL server and doing certain data engineer-light type work to get data into the server. You'll need to set all of this up, so BI is a blend of data engineering and data analyst type work.

Once you've gotten familiar with BI work it's very easy to transfer to data engineer / infrastructure engineer type work. This is especially true if you end up setting up a data warehouse (as an alternative to MySQL) or data lake on AWS to do your BI work. You don't have to, but if you go that far, you're pretty much doing data engineering at that point. The line between the two is fuzzy. Data engineers and infrastructure engineers are expected to be architects, and by that I mean they are expected to future proof the schema of the SQL server / data warehouse (future proof setting up the database for new data so it doesn't become a mess). A BI is not expected to be an architect and imo the only way to gain that skill is through first hand experience playing with databases, so a BI is a good way to get that experience.

At the current company I'm at the infrastructure engineers are expected to do BI work. This is unusual as the data scientists typically do it (roughly 60% of data scientists do BI work), because one of the data engineers I currently work with was a BI at his previous job. (He was on the sales team, helping them with more than just dashboards, like helping with their Excel spreadsheet algorithms and what not.)

I'm sure others could paint another path forward. Data engineers are highly in demand so it could be as simple as applying. If you can pass a white board interview (leetcode style interview) you can skip this step and dive right on in. Just like any technical white collar job, you're expected to self-learn what is required for the job before going in, so absolutely read guides / take classes / read books / etc on the topic to learn more.

----

To get a job as a data scientist:

BI work is a good bridge too, but not in learning setting up database skills, but creating dashboard skills. Around 60% of data scientists in the industry do BI work. Me, I've had to create internal dashboards for diagnosing problems, so it streams in live data in a visual way. This is not BI work, but there is clearly a bridge between BI dashboards and internal diagnostics dashboards.

Technically a data scientist is a kind of data analyst so many people go from data analyst directly to data scientist. Around 30% or so of data scientists do only data analyst work but have the data science title. (This 30% number is a bit of an estimate.) It's that strong of an overlap.

Data scientists tend to specialize. There sales data scientist, a marketing data scientist, an engineering data scientist, ops data science, and so on. Often times, but not always, they sit on the team they specialize in, instead of on a data science / data analyst team. At smaller companies they tend to hire a data scientist and expect them to do one kind of role. So it comes down to what kind of data science work you want to do. Sales data science roles tend to be BI heavy. Marketing data science roles tend to be data analyst heavy. Engineering data science roles tend to be the heavy model building roles that are the most challenging out of the bunch. Ops data scientists tend to specialize in malware detection and self-reporting. Eg, if someone is hacking the company's servers, they might get notified of an alert, and then they analyze it and report on it. There are other kinds of data scientists, like ones at super market companies and restaurants that specialize in forecasting warehouse items.

Me, I'm a specialist that specializes in robotics and sensor analysis. I'm not going to lie, it's probably the hardest out of every kind of data science role. It's very heavy on the engineering side, not data engineering, but software engineering, because there is a lot of advanced feature engineering.

Most feature engineering is simple stuff like deleting missing values, performing the medium over the dataset or other kinds of cleaning and minor modifications like normalizing the data. Then it gets spit into an ml library that identifies the pattern in the data, so when new data comes in it can identify if it recognizes that pattern. Each pattern is called a category and most ML work is categorization, so maybe you're categorizing different kinds of customers and if you can identify a pattern in their shopping habits, you might be able to predict what they will buy next.

Advanced feature engineering might need to be used when your patterns are so complex ML can't pattern match it well, so you have to give it a helping hand and manually do some of the pattern matching. I've also had to invent new forms of ML too, but it's been a while since I've had to go that far. What I do is the farthest from normal for data science.

Most data scientists do not know advanced feature engineering, but it's one of the bridges between software engineering and data science, so leveling up software engineering can help on that front. (Which is also why I bring it up.)

A data scientist shouldn't be expected to know much or any data engineering skills. Instead, gaining managing upward skills helps. How to do an sql query to get data and how to write a join is enough. You should do fine, just try learning data science itself instead of learning data engineering, unless you're curious. (A lot of universities and bootcamps teach machine learning engineer skills and call it data science. If it doesn't have data cleaning and feature engineer, it's probably not data science. Likewise if it has tensorflow or pytorch in it, it's not data science.)


Thank you for the detailed response. It makes things a lot clearer.

I notice that you mention "Machine Learning Engineer" as a separate role. If in the idealized world, data scientists do analytics and train models, and data engineers take care of data, then what do Machine Leaning Engineers do? Are they, basically, software engineers who specialize in putting other peoples' models into production?

And you are right in sensing my confusion. There seems to be an abundance or data-related titles, which seem to overlap in their functions a lot, but are also very different when you examine them closely. So thank you again for your responses, they are very helpful.


>Are they, basically, software engineers who specialize in putting other peoples' models into production?

It depends on the company. Traditionally, yes, but deployment into production can be automated, so typically today it is something different.

An MLE is someone who specializes in Tensorflow or PyTorch. They write deep neural networks, reinforcement learning, and more. Often times the data scientist will make a model, specializing in feature engineering and domain experience, and use a generic ML like a generic DNN or xgboost or whatever it may be. It then gets handed off to an MLE who writes ML specific for the problem to get every last drop of accuracy out of the model. They then hand it off to prod. I don't think they're on call (I could be wrong on this.) so today they're not really deploying models much. They're more an inbetween.

I work at small companies and startups so I've never worked with an MLE, but I do have friends who are managers at Google who told me about it, so that's where this information is coming from, telephone game. In other words, I'd take this with a grain of salt. ymmv.

Starting in 2018 big name companies couldn't get enough MLEs and they pay higher than DS', but many bootcamps and universities center around ML skills, so companies started renaming MLE positions to DS positions. This way they get more applicants and they pay them lower. Win-win for them. Too bad it messes up the industry. Today about 1 in 3 data science jobs are ML heavy. They may be MLE exclusive or a hybrid wearing multiple hats light DS to light MLE type jobs.

You can identify which is which if they give you a white board coding problem. Traditional data science work will never have a white board problem.

>So thank you again for your responses, they are very helpful.

You're very welcome. I hope it helps.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: