Hacker Newsnew | past | comments | ask | show | jobs | submit | dakshgupta's commentslogin

How would you measure code quality? Would persistence be a good measure?

"It's difficult to come up with a good metric" doesn't imply "we should use a known-bad metric".

I'm kind of baffled that "lines of code" seems to have come back; by the 1980s people were beginning to figure out that it didn't make any sense.


Bad code can persist because nobody wants to touch it.

Unfortunately I’m not sure there are good metrics.


That question has been baffling product managers, scrum masters, and C-suite assholes for decades. Along with how you measure engineering productivity.

The folks at Stanford in this video have a somewhat similar dataset, and they account for "code churn" i.e. reworking AI output: https://www.youtube.com/watch?v=tbDDYKRFjhk -- I think they do so by tracking if the same lines of code are changed in subsequent commits. Maybe something to consider.

I don't know if code is literacy but I think measuring code quality is somehow like measuring the quality of a novel.

The way DORA does. Error rate and mean time to recovery.

This is a great suggestion. I'll note it down for next years. Curious, do you think this would be a good proxy for code quality?

I would consider feature complete with robust testing to be a great proxy for code quality. Specifically, that if a chunk of code is feature complete and well tested and now changing slowly, it means -- as far as I can tell -- that the abstractions contained are at least ok at modeling the problem domain.

I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.


Most of our customers are enterprises, so I feel relatively comfortable assuming they have some decent testing and QA in place. Perhaps I am too optimistic?

That sounds like an opportunity for some inspection; coverage, linting (type checking??), and a by-hand spot check to assess the quality of testing. You might also inspect the QA process (ride-along with folks from QA).

It's tricky, but one can assume that code written once and not touched in a while is good code (didn't cause any issues, performance is good enough, ecc).

I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.


It depends on how well you vetted your sanples.

fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.


Apologies, that is poor wording on our part. It's internal data from engineers that use Greptile, which are tens of thousands of people from a variety of industries. As opposed to external, public data, which is where some of the charts are from.

This is per month, I see now that's not super clear on the chart!

We're careful not to draw any conclusions from LoC. The fact is LoCs are higher, which by itself is interesting. This could be a good or bad thing depending on code quality, which itself varied wildly person-to-person and agent-to-agent.

When the heading above it says "Developer output increased by x" I think you're very much drawing conclusions

Can you expand on why it is interesting?

Because it's different. Change is important to track

We weren’t able to agree on a good way to measure this. Curious - what’s your opinion on code churn as a metric? If code simply persists over some number of months, is that indication it’s good quality code?

I've seen code persist a long time because it is unmaintainable gloop that takes forever to understand and nobody is brave enough to rebuild it.

So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it


I second the persistence. Some of the most persistent code we own is because it’s untested and poorly written, but managed to become critical infrastructure early on. Most new tests are best-effort black box tests and guesswork, since the creators have left a long time ago.

Of course, feeding the code to an LLM makes it really go to town. And break every test in the process. Then you start babying it to do smaller and smaller changes, but at that point it’s faster to just do it manually.


You run a company that does AI code review, and you've never devised any metrics to assess the quality of code?

We have ways to approximate our impact on code quality, because we track:

- Change in number of revisions made between open and merge before vs. after greptile

- Percentage of greptile's PR comments that cause the developer to change the flagged lines

Assuming the author is will only change their PR for the better, this tells us if we're impacting quality.

We haven't yet found a way to measure absolute quality, beyond that.


Might be harder to track but what about CFR or some other metric to measure how many bugs are getting through review before versus after the introduction of your product?

You might respond that ultimately, developers need to stay in charge of the review process, but tracking that kind of thing reflects how the product is actually getting used. If you can prove it helps to ship features faster as opposed to just allowing more LOC to get past review (these are not the same thing!) then your product has a much stronger demonstrable value.


I've seen code entropy as the suggested hueriatic to measure.

We expressly did not conclude that more lines = better. You could easily argue more lines = worse. All we wanted to show is that there are more lines.

Language like "productivity gains", "output" and "force multiplier" isn't neutral like you're claiming here, and does imply that the line count metric indicates value being delivered for the business.

> Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.

I mean, come on, now. "Force multiplier" is hardly ambiguous.

We have known that this is a useless way to measure productivity since before most people on this site were born.


We were trying not to insinuate that, because we don’t have a good way to measure quality, without which velocity is useless.

Hi, I'm Daksh, a co-founder of Greptile. We're an AI code review agent used by 2,000 companies from startups like PostHog, Brex, and Partiful, to F500s and F10s.

About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.


If AI tools are making teams 76% faster with 100% more bugs, one would presume you're not more productive you're just punting more debt. I'm no expert on this stuff, but coupling it with some type of defect density insights might be helpful. Would be also interested to know what percentage of AI assisted code is "rolled back" or "reverted" within 48 hours. Has there been any change in number of review iterations over time?

I’m interested in earnings correlating with feature releases. Maybe you’re pushing 100% more bugs, but if you can sell twice as many buggy features as your neighbor at the same time, it could be that you could land more contracts.

It’s definitely a raise to the bottom scenario, but that was already the scenario we lived in before LLMs.


Right? I want to see the problem ticket variance year over year with something to qualify the data if release velocity is more frequent.

i wouldnt find that convincing.

plenty of tickets are never written because they dont seem worth tracking. an llm speeding up development can have the opposite effect - increasing the amount of tickets because more fixes look possible than before


Fair. Everything has nuance.

Hey! Thanks for publishing this.

Would be interested in seeing the breakdown between uplift vs company size.

e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.


This is a good one, wish we had included it. I'd run some analysis on this a while ago and it was pretty interesting.

An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.


It’s hard to reach any conclusion from the quantitative code metrics in the first section, because as we all know, more code is not necessarily better. “Quantity” is not actually the same as “velocity”. And that gets to the most important question people have about AI assistance: does help you maintain a codebase long term, or does it help you fly headlong into a ditch?

So, do you have any quality metrics to go with these?


We weren’t able to find a good quality measure. LLM-as-judge dint feel right. You’re correct that without that the data is interesting but not particular insightful.

Thanks for publishing this. People will complain about your metrics, but I would say its just useful to have metrics of any kind at this point. People talk a lot about AI coding today without having any data, just thousands of anecdotes. This is like a glass of pure water in a desert.

I'm a bit of an AI coding skeptic btw, but I'm open to being convinced as the technology matures.

I actually think LOC is a useful metric. It may or may not be a positive thing to have more LOC, but its data, and that's great.

I would be interested in seeing how AI has changed coding trends. Are some languages not being used as much because they work poorly with AI? How much is the average script length changing over time? Stuff like that. Also how often is code being deleted and rewritten - that might not be easy to figure out, but it would be interesting.


> About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.

Regardless of the source, it's still an interesting report, thank you for this!


Thanks! The first 4 charts as well as Chart 2.3 are all from our data!

I actually ended up enjoying reading the cards after the charts more than I did reading the charts, but the charts were really interesting too.

Wish you'd show data from past years too! It's hard to know if these are seasonal trends or random variance without that.

Super interesting report though.


It's a shame that the AI branch is the software engineering industry is so determined to make us look like compete fools.

WHY ARE THEY STILL TALKING ABOUT ADDING LINES OF CODE IF THEY KNOW HOW SOFTWARE COMPLEXITY SCALES.

I could not put it more simply: you don't get the benefit of the doubt anymore. Too many asinine things have been done like this line-of-code-counting BS for me to not see I it as attempted fraud.

Something we know for sure is that the most productive engineers are usually neutral or negative on lines of code. Bad ones who are costing your business money by cranking out debt: those amp up you number of lines


I cannot believe how often I have to call out ostensibly smart AI people for saying shit that is obviously not literally true.

It's like they all forgot how to think, or that other people can spot right where and then they stopped thinking critically and started to go with the hype. Many lines of code good! Few lines of code bad!


very cool report. been looking for some data on this (memory + AI SDKs) for a while :)

Greptile | Software Engineer (junior, senior, staff)| San Francisco ONSITE | https://greptile.com

Greptile is building AI agents that catch bugs in pull requests. Over 2,000 teams including Brex, Whoop, and Substack use Greptile to review nearly 1B lines of code every month.

We're a team of ~20 in San Francisco, working on things like better agent evals and sandbox execution environments.

We've raised $30M to date, including our recent Series A led by Benchmark.

Stack: Typescript

Open roles: greptile.com/careers

Salary ranges: $140k-270k base (depending on seniority) + $40-100k/yr equity


Greptile | Software Engineer | ONSITE San Francisco (SF) | https://greptile.com

Greptile is working on AI agents that catch bugs and enforce standards in pull requests. Reviewing nearly 1B lines of code a month for 1000+ companies including Brex, Substack, Whoop, as well as multiple F100s.

<20 people, raised ~$30M from Benchmark, YC, Paul Graham, SV Angel and others.

To apply, email daksh [at] greptile.com with subject line "Engineering at Greptile". Include most recent role and company and links to your LinkedIn and GitHub.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: