Contributing to Open Source SQLFluff During my Internship at Bluesky

Yilang He

Meet Yilang He, Software Engineer Intern 

Yilang is a software engineer intern at Bluesky.  During his first month with Bluesky, Yilang started with an open-source project: SQLFluff with the goal to improve its support for Snowflake.  SQLFluff is a dialect-flexible and configurable SQL linter. Designed with ELT applications in mind, SQLFluff also works with Jinja templating and dbt. SQLFluff will auto-fix most linting errors, allowing you to focus your time on what matters most. While this is a big and foreign codebase for Yilang, he hit the ground running and learned really fast. At this point, Yilang’s first patch has already been merged back into the open-source sqlfluff.

Bluesky leverages a lot of open-source software.  This is Bluesky’s first contribution back to the open source community.  Going forward, we will continue to find more opportunities like this to contribute.

Tell us about yourself

My name is Yilang, I’m a senior at UC San Diego majoring in computer engineering. I grew up in China. When I started at UC San Diego I didn’t have a lot of experience with coding or computer science. But I took a 101 CS class and was fascinated by the power of coding and now I’m interning at Bluesky as a Software Engineer. 

How did you learn about Bluesky?

I first learned about Bluesky at the Greylock Career fair in summer ‘22. Greylock led the $8.8M seed investment in Bluesky. As I read more about their mission - to make data efficient and easy to use - and the founders Mingsheng and Zheng, I realized it was an opportunity that I couldn’t pass up. 

Working on SQLFluff

I’m working on an open source project called sqlfluff. My primary job is to debug and solve the Snowflake related issue of this project by making it more compatible and working better with sql query in Snowflake dialect.

This internship allows me to learn about Bluesky, a startup with an audacious mission. It also allows me to become part of the sqlfluff community and build valuable open source software. An open source project is sort of like a project at any company you might work at; there will be a house coding style, team culture and workflows for getting things done. The difference is that open source projects can and will have a much different group of people working on them. I’m simultaneously working with and learning from both the Bluesky team and the open source community - it’s really awesome. 

What is the motivation behind SQLFluff?

By fixing the Snowflake related issues of sqlfluff, it can be more compatible with sql query in Snowflake dialect. We can thus efficiently reformat the sql query and not have to worry about potential parsing errors. Additionally,  well-formatted sql code will be beneficial for the query optimization that we are trying to achieve at Bluesky. 

Another motivation is that since sqlfluff can be utilized in Bluesky’s pipeline we will learn from this implementation.  And as a responsible open-source project user, we will contribute back to the sqlfluff community.

What was challenging about the SQLFluff project? 

The first challenge I encountered was setting up the environment. We want to make sure that we debug the source code instead of the package that we download to the python library. Another challenge is that it is hard to deal with the large code base and understand the implementation and relationship. Here I would like to shout out to my mentor, Bluesky co-founder Zheng Shao, who believes in learning by doing. Zheng also worked at Uber where they had an official engineer-engineer mentoring program and knows first hand the importance of mentoring for building strong teams, especially with junior engineers.  Zheng frequently reviewed my code and offered really instructive guidance on environment set-up and debugging. This helped me produce quality code and personally accelerated my professional development.

What skills did you learn while working on the project?

SQLFluff was my very first experience contributing to an open-source project and it provided a invaluable learning experience. A few key lessons include learning how to understand the code base, discovering the issue, communicating with other contributors or more senior engineers and also submitting pull requests to solve issues. Debugging was probably the most important skill I learned. When it comes with a large code base and complex relationship, it is fairly hard to understand and debug the code by just reviewing the code directly. Using a debugger is then an important skill in analyzing the code base. 

What are the next steps of the project?

For the next step, we will start to test sqlfluff on the Snowflake query from our client and further improve its compatibility with the Snowflake dialect. This could help Bluesky better parse our customer’s query into desired format, so that it will be more convenient for analyzing their cost and efficiency. 

How would you describe the Engineering team at Bluesky?

The Bluesky eng team is intelligent and driven. Even though we are a remote team, there is open communication and people are very responsive. Everyone is willing to pitch in to get the job done. There’s no ego, everyone works collaboratively

If you are passionate about big data and machine learning and interested in joining the team, Bluesky is hiring! Check out job openings here