Retrospection in 2019

I know it is a little late to reflect what I have done in 2019. But later is better than None. Many things happened in 2019. One of the biggest event is that I graduated from Carnegie Mellon and started my first full-time job in the life. I was lucky to start my career with my dream job related to data science. In the reality, data science could be more complex than what I expected. The advantage of working at a startup is that there are a lot of opportunities around data. I would tell more later.

Learning path in 2019

From January to May, I still enrolled at master program, and got chances to take elective courses. The main focus was on big data, including NoSQL Database Management, and big data analytics with PySpark. I also followed the Big Data Specialization provided by UCSD at Coursera. I finished five of all the six courses, and one capstone project to be finished.

Proudly speaking, I finished Deep Learning course. It is the most difficult course I have ever taken in my life, and I am still glad that I stuck with it to the end. I used Pytorch to build models to solve problem in computer vision, and transcript translation. The last assignment took my about two weeks to complete because it required us to build an attention model from scratch. One silly bug just held me for a long time, and finally figured it out with TA’s and friends’ help. The lesson I got from taking 11785 is that team learning is very important. Friends of mine formed a study group and met every week to discuss the concept learned in the class and some key points in the assignment. Not copied and pasted lines of code, certainly which was not allowed, but did we help each other out if we got stuck with thoughts and misunderstanding. And I am also grateful for the help from TAs. They were so patient and responsible for any of your questions and bugs. They were very smart. One undergrad TA helped me find a bug which was so hard to find. Although I could not say I take the knowledge 100% applied to the reality, I still think it was a great journey to make my life harder. It will pay off some day I believe.

In July, I started to work, and work is a big part of learning.

My heart is in the work

Well, when I started working, I just realized that my heart could not be only in the work. Yes, there are so many other fun things I am eager to pursue. However, when I am working, I have to put 100% of my attention to it.

All about pipelines

Until now, I realized that I actually work more like an engineer. After onboarding for 3 months, I felt like I was just not a typical data scientist, and then requested to change my title to Data Scientist (previously Data Engineer). However, recently, I positioned myself to be a machine learning engineer or a data engineer.

end-to-end model building

The first project was to build a pipeline involves an end-to-end machine learning process: model training, model evaluation and model logging. I was very excited about it because it not only helped build one, or two models, but it was aimed to build thousands or millions as we wanted with the help of distributed computing. Dask, Prefect and MLFlow were totally new tools for me to integrate them in one pipeline. It took me about two months to get done. Sorry, not completed done. One lesson I learned from job was that there was not 100% completed tasks. Everything is to be improved every day. Skills and visions are growing everyday, and every time when I looked back my code, I would ask myself why i wrote that piece of s***. And then I realized that a good structure of code is so much of importance. A bad structure always makes it hard to revisit, refractor while a good one is clear and flexible to add new functions. I had not that experience when I started it, but that was totally okay. I will do my best to improve it all the time.

chain all the notebooks

The second project was to write pipeline to run notebooks of feature engineering. As a team of data science, everyone has its own contribution and strengths. I did not work on feature generation, but I used the tools to run the whole process more efficient. Luckily I learned Papermill here. I never know that notebooks could be ran in just one line of code in another notebook or even in a command line.

typical data flow

The third big one was to handle data transfer, data transformation and data validation. I used Embulk to transfer data from MySQL database to Redshift in a nightly base. Also, I worked on data transformation with my coworker using DBT to handle data from different clients to make sure they have the standardized format across all the clients. Last, we used Great Expectation to validate if data is right. Besides, I also developed a library to check if data transfer is right. For example, does the count of rows match before and after transfer? does the value match for the same column? The quality control in the upstream is so essential that all the feature building and models rely on it. Boredom was not inevitable at first, but as long as I understood its position to the entire process, I still focused on details and never let anything wrong happen.

A little more in the work

It is hard to describe what I gained from work in just one paragraph. The knowledge learned from college was so helpful and great, but it was dead to be honest. Working on industrial projects is a golden chance to see the entire data flow. When I learned NoSQL database, I could not understand why traditional database was limited, why CAP theorm was so important. Until I saw the whole picture of how our data moved from clients to us, everything makes sense. Knowledge was so alive now. It happens every day.

As my work requires programming so much, I am forcing myself to become more professional in coding. In the team, coding is not only for myself, but also for the team. Readaibility, reuseablity, and efficiency are the most things I need to care about. Right now projects are independent somewhat, coworkers do not really need to build libraries on mine. However, chances are they need to understand what they can do with your library. Then documentation and tutorial are helpful for them. This is a good habit to comment code and write docstrings all the time. As the team grows bigger, one day others might have to read your code and refractor them. Then it would save them a lot of time.

Some other things I did not get from school is that version control and team work. I used Git before for my own project, but simply I did not need to care about others since I am the only contributor. In the reality, git means a lot especially you have teammates to work on the same project. Sadly, I am still confused about git. It is so tricky. But I have already set my goal to nail it in 2020. Version control not only helps codes, also helps model management. MLFlow is a great tool is keep track of models so that you know how you improved or not in the past.

Scheduling, bash scripts, inheritance, decorators, and many other cool things came to me last year. They really broaden my vision.

How I see the career transition

I would say most of my work focus on machine learning engineer or data engineer, not really data science stuff. However, it does no harm if I call myself data scientist. Actually, data science is so broad that no one can limit it into only field. It is a highly hybrid subject which involves computer science, business and statistics. Before taking the projects, I thought my job would be high involved with modeling, and explaining. It turned out to be use advanced tools to help data scientists build models in scale and develop a standard and stable infrastructure of data pipeline. I have been enjoying what I have done. These work about infrastructure and pipeline is not something I could learn at school. Thus, it was hard to say no.

I did think over my career transition, from a data scientist to a data science engineer. Analysis is really not my thing, to be honest. What I really fond at is to deploy the analysis, automate the analysis and make it scale.

It is easy to lose passion. To be honest, I am not always passionate at work. At some point, I felt frustrated, depressed and sad. However, everyone should allow themselves to feel that way. It is okay not to be a master now, but I should always know what I should learn. A good way to keep passion is to be aware of the unlimited knowledge which I swear to pursue in my entire life.

Things in 2019

I know I have said a lot about 2019. This part I just wanted to list what I have done:

  1. finish deep learning specialization with sequence modeling
  2. learn pyspark in a nutshell (by course)
  3. wrote libaries in Python
  4. finish five courses on big data (coursera)

Expectation in 2020

There are some expectations with some solid goals. First, I want to be more fluent in Python. In the more engineering part:

  1. finish Fluent Python
  2. master version control (git)
  3. master linux/bash comment
  4. master CI CD, docker, container
  5. master multithreading, distributed system
  6. Master data sctructure and algorithms (practical)

I am still into big data area. I have been used DASK a lot so far, and I really like it. However, I think Spark is still the most trending language when coming into big data.

  1. hadoop
  2. spark with python and scala
  3. design designging data-intensive applciations
  4. kafka

Data/Machine Learning Engineer Full path

  1. Data modeling (SQL, database design) —udacity
  2. big data (spark, hadoop) –udemy, linkedin
  3. advanced programming (book, linkedin, leetcode, etc.)
  4. machine learning topic (nlp, recommender systems)
  5. docker
  6. linux
  7. scala (with spark)
  8. airflow

One way to learn is to teach. I cannot say I can teach, but I can write posts. I hope I can write more posts in this year. I paused for too long. One reason is that admittedly I wrote posts before to showcase my effots to tell employers that I am qualified. After getting th offer and starting working, I put all my time into the work, and write a lot in the workplace. The other reason is that my personal computer gets slower and slower, and I hesitated to turn it on.

Today, I started writing, and felt good. It worth looking back to see how many things i have done and how much I have used or wasted. Lastly, procrastination is an excuse to avoid the failure of prefectionism, but there is no prefectionism at all. I am so glad I can get it done.

To conclude this post, I used one of my tweets the other day “what kind of person I desire to be in 10 years? I am amazed by the acievements and hard work the professions have made in the past ten years. And so will I.”

Cheers!