28th Jul 2020 10 minutes read Who Is a Data Engineer? Adrian Więch data engineering Jobs and Career Table of Contents Data Engineer Responsibilities How to Become a Data Engineer Skillset Programming Automation and Scripting Databases Big Data Tools Big Data Tools Cloud Computing Summary A new kind of job has recently emerged in the IT world: Data Engineer. At first sight, it may seem very similar to Data Analyst or Data Scientist positions. However, our article explains all the important differences. We present the skills, tools, and everyday tasks of Data Engineers. We also explain how you can get started with this career path. Thirty years ago, we typically used terms such as “Computer Scientist” when referring to anyone working with computers. A lot has changed since then, and the IT industry has seen the appearance of countless new positions. We now have Developers, System Administrators, IT support employees, security experts, and many more. In recent years, the position “Data Engineer” was also coined. Let’s try to explain what this position is about. Data Engineer Responsibilities The IT industry is generating more and more data every year: information about Uber rides, Netflix subscribers, Google search hits… All of these are perfect examples of big data. The amount of data created by these popular companies is so big that they now need employees whose job will only be to help collect and process information. Data Engineers create and maintain the infrastructure necessary to store and process large amounts of data. They identify the kind of data that can be acquired and make sure that the collection process meets the business requirements and industry standards. Data Engineers create data pipelines and flows, and they make sure that the processing of big amounts of information is performed correctly and as efficiently as possible. They also create database structures to store the data they collect, and they make sure that all the information is persisted securely. For example, Data Engineers at Uber will have to figure out how to handle the information stream coming from drivers' and riders' smartphones. Since there are thousands of people taking rides at the same time, Data Engineers will need to use efficient tools and algorithms to make sure that no data is lost or delayed in the process. Uber databases will have to be set up properly so that they don't choke when the huge amount of data is saved. The raw stream of information stored in these databases is then typically processed to get more meaningful data that various Uber departments (such as marketing or business development) can work on. This processing of raw data is also a task for Data Engineers. Data Engineers often work closely with Data Scientists. The former is responsible for acquiring and maintaining data with big data tools while the latter focuses on performing analyses of the data to derive meaningful business insights. In other words, a Data Engineer must first acquire and store the data in an efficient way before a Data Scientist can interpret it. In the Uber example, a Data Scientist will probably retrieve the riders’ data from the database once a Data Engineer has prepared it. They will then use various statistical models to find answers to key business questions: “what are the most popular Uber routes in a given city (based on the GPS points)?” or “which Uber app screens are used most (based on the screen transition times)?” How to Become a Data Engineer Becoming a Big Data Engineer isn’t as obvious as becoming a Developer or Database Administrator. Data Engineering is a multidisciplinary field which has emerged recently, and universities around the world typically don’t offer a degree in Data Engineering. Related degree options for aspiring Data Engineers include Computer Science, Data Science, Analytics, or Mathematics. However, Data Engineers are born from experience rather than taught at universities. Chances are that your first job will be something else. You may start as a Software Engineer, Analyst, or a Data Scientist and then learn the missing concepts to eventually become a Data Engineer. At smaller companies, where a single person is responsible for a number of different roles, you may perform Data Engineering tasks without being explicitly called a Data Engineer. Skillset As we mentioned before, Data Engineering is a multidisciplinary field. That is why you will typically need to acquire more than one technical skill to do your job. Below, we present the five most important areas of technical expertise that are helpful for Data Engineers. Programming Data Engineers inevitably need to acquire some programming skills. The most popular languages in this field are Python and Java/Scala. Python is a modern programming language which is easy to read and learn. Compared to languages such as C++ or Java, you typically write less code to get the same results. Python offers a huge amount of helpful built-in functions and shortcuts, which is why it allows you to write software quickly and efficiently. Therefore, it is the favorite programming language of many IT specialists such as Data Scientists. As a Data Engineer, you’ll be closely cooperating with these specialists, so it makes perfect sense to learn Python. If you don’t know where to start, you can pick one of the interactive Python courses from LearnPython, such as Python Basics. Part 1. You should also take a look at Java or Scala. Both of these languages run on Java Virtual Machine. Java has been around for decades and is one of the most popular programming languages in the world. On the other hand, Scala is a more modern choice with stronger support for functional programming. Java and Scala are both natural choices for Data Engineers because many tools used in the industry are written in one of them. Apache Kafka (Scala), Hadoop (Java), Apache Spark (Scala), and Apache Cassandra (Java) are just a few examples. Applications written in Java/Scala may also work more quickly than those coded in Python, which is another factor worth keeping in mind. Java and Scala are both natural choices for Data Engineers because many tools used in the industry are written in one of them. Apache Kafka (Scala), Hadoop (Java), Apache Spark (Scala), and Apache Cassandra (Java) are just a few examples. Applications written in Java/Scala may also work more quickly than those coded in Python, which is another factor worth keeping in mind. Automation and Scripting As a Data Engineer, you will typically automate a lot of tasks. You may need to clean up your database table every few days or run a backup procedure on your data sets. Such tasks can be easily automated with scripting languages, such as Bash. People typically get to know the basics of scripting when they get down to programming, but as a Data Engineer, you may need to put more focus on it. Databases Databases are an essential concept for any Data Engineer, as they are the most typical solution for data storage. You will probably need to learn both relational and non-relational database concepts. A good place to start is the SQL language, which is the de-facto standard for querying databases. It’s traditionally connected to relational databases, but many non-relational tools also allow you to run SQL-like instructions. Big Data Tools Databases are an essential concept for any Data Engineer, as they are the most typical solution for data storage. You will probably need to learn both relational and non-relational database concepts. A good place to start is the SQL language, which is the de-facto standard for querying databases. It’s traditionally connected to relational databases, but many non-relational tools also allow you to run SQL-like instructions. Big Data Tools Big data tools typically lie at the core of Data Engineering. While most Software Engineers focus on programming and some database basics, Data Engineers need a strong foundation in the big data toolset. This is a broad term that encompasses multiple data processing techniques. With huge amounts of input data, you will typically need to use parallel processing. Apache Spark is probably the most popular parallel processing engine right now. It is supposed to outperform older solutions such as Hadoop. Apart from parallel processing of data, you can also learn to process big data in streams. Apache Kafka is a popular choice for data streaming and a good place to start. Cloud Computing Traditionally, companies would set up their own physical servers in their offices to store and process the data they need. However, managing servers on your own is costly, and many companies have decided to cut the expenses. This is why cloud platforms were created. Companies such as Google, Amazon, and Microsoft offer their own servers where you can store data, perform computations, and manage your data processing tasks. When performing computations and data processing in the cloud, you typically only pay for the time and CPU power you actually use. Cloud solutions offer additional services, such as auxiliary servers and automatic file backups. This is why these platforms are gaining more and more popularity. As a Data Engineer, there’s a good chance you’ll work with such solutions. The most popular choices are Google Cloud (Google), Azure (Microsoft), and AWS (Amazon). In fact, the number of available cloud products is now so wide that there are IT professionals who specialize just in that! Learning SQL for Data Engineering If you’re just beginning your IT career, you may be daunted by the amount of technologies that we mentioned above. There are different approaches to learning Data Engineering. At LearnSQL.com, we believe that getting to know relational databases and SQL is a good place to start. Relational databases are the most widely used type of databases. SQL, in turn, is the language used to communicate with such databases. In the IT world, you will find many people who only know how to write SQL SELECT statements. They are used to retrieve data from databases. However, as a Data Engineer, you will also need to learn how SQL can be used to manage the database structure. This will give you more insights into how database tables are created, what data types they offer, and how you should model your database to make them quick and efficient. At LearnSQL.com, we now offer a dedicated Data Engineering path, where you can complete five important courses: The Basics of Creating Tables in SQL—This course is meant for beginners and covers the basic syntax of creating tables in relational databases. Data Types—This course discusses SQL data types in detail. If you want to get an idea about what data types are, you can read our article about numerical data types. Constraints—This course covers various constraint types available in an SQL database. Views—This course explains what SQL views are and how to create, modify, and remove them. Indexes—In the final course, you’ll understand how to use indexes in databases to speed up the data retrieval process. Don’t worry if you’re afraid of setting up a database management system on your computer. LearnSQL.com does all of that for you. All you need is a web browser with internet access. We prepare a database structure for you in the background so you can focus on learning SQL. Another important advantage of learning with us is that you get all the core concepts related to managing databases in a single place. You don’t need to look around the internet for various materials focused on single topics. Instead, you get a comprehensive learning experience with us. We start with easy concepts and guide you throughout the learning process. The new SQL learning path is tailored to the needs of aspiring Data Engineers, so you don’t need to spend time deciding what is important for this particular career path—we’ve done the job for you! Summary The position of Data Engineer is constantly evolving. Data Engineers focus on creating database structures and processing large amounts of data efficiently. It is a multidisciplinary role which requires some knowledge of programming, automation, databases, big data, and cloud computing. Tags: data engineering Jobs and Career