Subscribe to data newsletters
Before jumping into popular MOOCs or purchasing recommended books on Amazon, I started by subscribing to various data science and data engineering newsletters. At first, I was reading every single article and taking notes, but over time learned to recognize the important links shared in multiple newsletters and focus on a few. Newsletters are great to stay up to date with new tools, academic research, and popular blog posts shared by large internet giants (e.g. Google, Netflix, Spotify, Airbnb, Uber, etc).
Here are some of my favorite newsletters:
- Tristan Handy’s Data Science Roundup: Tristan provides his own commentary on his curated list of data science articles.
- Data Science Weekly: A curated list of data science, AI, and ML-related articles and blog posts. I also find the Training & Resources sections to be a useful collection of online tutorials.
- Hacker Newsletter: A weekly newsletter featuring hand-picked articles from Hacker News. It’s not data science/engineering specific, but there is a dedicated section on data and code that are relevant.
- AI Weekly from VB: Thoughts from writers at Venture Beat with a collection of articles related to AI.
I also subscribe to Data Machina, The Analytics Dispatch, and AI Weekly.
Craft your own data curriculum
Next, depending on your focus, you need to craft your data science, data engineer, or data analyst curriculum. This may include learning how to program in Python or R if you are switching careers from a non-programming role. If budget is not a concern, joining a bootcamp or taking courses from Udacity and Dataquest may be a great option to get online mentorship from industry experts. However, if you are price-conscious like I was, you can opt to follow open-source guides to create a free curriculum:
One caveat here is that simply taking these courses is not be enough. I generally found most courses and tutorials online to focus on either the foundational knowledge (e.g. math, statistics, theories) or simplified guides to walk through a trivial example. This is especially true in big data since tutorials tend to use a smaller subset of the data to run locally instead of walking through a full production setup on the cloud.
To supplement the theory with realistic scenarios, I suggest joining Kaggle and using Google’s free tools such as Colab to practice working with large datasets. You can also search for Github repos from Udacity students to see what a capstone project might look like.
Network with experts for free
Any career guide would tell you that networking is important. But how does one go about finding industry experts willing to mentor or simply answer some questions? Prior to the pandemic, one option was to attend meetups, but that opportunity was largely limited to residents in major tech hubs like the Bay Area, New York, or Seattle (at least in the US). The other option was to attend conferences or workshops focused on data science, machine learning, or data engineering. However, the tickets for these events were very expensive, making it impractical for individuals to attend without company sponsorships.
As a startup employee living in Baltimore, my solution was to network online by first watching free videos of sessions held by industry partners at tech conferences (e.g. AWS re:Invent, Microsoft Ignite, or Google Cloud Next) and connecting with the speakers on LinkedIn. Aside from the keynotes and the sessions on new cloud product releases, there are tons of sessions on best practices and architecture discussions where a product manager or a lead developer from an industry partner (e.g. Lyft, Capital One, Comcast) would present with a solutions architect at AWS/Azure/GCP on solving a real problem at scale. I would take notes on the session and then reach out to all the speakers on LinkedIn with a question about their product or an architectural decision mentioned in the talk. Surprisingly, almost all the speakers were willing to respond and continue to conversation with me, even though I was just a recent grad working at an unknown startup at the time.
Over time, I steadily grew my network this way and had the added benefit of staying up to date with new products and industry trends across all the major cloud providers. Considering the current situation with COVID-19 and the continued shift towards virtual events, this may become the new norm in networking instead of attending conferences to meet other stakeholders in person.
Get certified
While cloud certifications are by no means validation for ability or data knowledge, I still think there’s value in investing in certifications. This is especially true if you are aiming to be a data engineer as cloud knowledge is imperative for running production workloads. Even for data scientists, becoming familiar with cloud products enables you to actually focus on analyzing the data instead of struggling to load and clean data at scale.
Another underrated aspect of getting certified is the network opens up. There are very active members on LinkedIn, particularly in tech consulting, posting about new opportunities in cloud data positions. Some recruiters post directly in LinkedIn groups for certification holders only. Certification alone won’t lead to a new job or position, but having those badges makes it easier to start a conversation with others or recruiters. Personally, I landed a few small consulting projects after acquiring the certifications.
Solve real problems
Finally, as with any engineering discipline, you will only improve with practice. If you are already working as a data scientist or data engineer, getting real-world experience should not be an issue. For others looking to transition, many will recommend building a portfolio. But where do you start? Working with the classic Titanic dataset for survival classification or clustering for the iris dataset is likely to hurt your portfolio than help you.
Instead, try to use public Github projects as inspiration. Based on the network you amassed from LinkedIn via tech sessions and certifications, look at what others are building. Feel free to use examples from Udacity or Coursera projects on Github. Then mix in real datasets from Google Research, Kaggle, or search for an interesting dataset and start building solutions for real problems.
If you are interested in a sector or a specific company, try to search for public datasets and build a sample project. For example, if you are interested in fintech, try using Lending Club’s public loan data to build a loan approval algorithm. The biggest takeaway from working with real datasets is that these are very messy and noisy compared to ones provided in academic settings.