Open Source developers would really find this tool useful if they were able to pull in all their code contribution history from any project, not just for LF projects. Can we discover and present contributions associated with any GitHub account associated with the user ?
I agree @Shubhra, open source developers would then be able to use Individual Dashboard as their open source LinkedIn. Many open source communities would also love if non LF project events were represented on their profiles as well.
@ProdMgrs could this be considered in future updates?
@Henry_Quaye yes this will be considered for the future releases
+1
Our team (Enarx project) contributes to upstream projects (Rust, WebAssembly, etc) and being able to visualize those contributions would be important.
Besides being able to visualize individuals contributions, or project contributions, if we could somehow visualize aggregate contributions (from more than one project), that would be splendid.
This is great news @Nadia_Shomali! Can’t wait to see this!
@NickVidal interesting recommendation, we do have the compare projects feature in Insights, where you can view metrics from multiple projects, but could you further explain your use case for this?
I think @NickVidal what you are thinking is being able to answer the question “What projects are Enarx contributing to?” and being able to break down the # of contribution for each with the goal of seeing which ones Enarx is engaged in the most - is that accurate?
Hi John, that’s correct. Thank you for the clarification.
I agree, this would be really valuable, especially as LFX has broader utility than just for LF projects.
I presume we could also use gharchive to get a broader history beyond what the GitHub API offers right now?
We looked at gharchive, since DevStats was using it. We could use it, but it had some limitations if I am not wrong @sgupta ?
Yes we had actually almost implemented the support for GH archives as a connector but it failed to pass through our QA validation as we saw many events from the source Github repositories missing from the archives on a regular basis. We have detailed the issues here Missing events in data files and missing data files · Issue #245 · igrigorik/gharchive.org · GitHub.
So we decided to move directly to the source i.e. Github APIs to pull the data for all the monitored repositories. It is slow I agree because the of the rate limitations from Github but data availability and correctness is guaranteed.
Hope that helps.
Cheers,
Sachin
“So we decided to move directly to the source i.e. Github APIs to pull the data for all the monitored repositories. It is slow I agree because of the rate limitations from Github but data availability and correctness is guaranteed.”
Do you feel this method is extendable to non-LF project metric gathering assuming the projects are already GitHub hosted? (That feature would help fill my organization near term metric needs or at least give us some examples of where we have more gaps)
Yeah, this was my concern too about using the APIs directly - the rate limiting is pretty brutal.
It has been a long time since I wrote software that used gharchive, but I had similar issues back then with missing events. Hopefully this can be resolved soon.
Welcome to the community, @Matthew_Weber - I will defer to @sgupta on this, but I would assume so as you can use the GitHub API to query any projects there as far as I know.
Hi @Matthew_Weber,
Welcome to the community.
Scaling for hundreds and thousands of open source projects hosted on Github and being able to at least sync data once daily will certainly be a challenge with the native GH API approach unless we get some concessions over rate limiting. So at that level GH archive is a viable options if we can live with a few inconsistencies with the representation of the data considering the limitation that I presented in my earlier comment. One way to have the least impact on the data visualization is that if we present aggregated data and not individual events like showing commit hashes or submitted PRs. That way the data loss can be averaged out to still show near to correct contributions.
I think this is the approach most other platforms including opensourceindex.io and cncf devstats is using.
So to answer Matthew’s questions the GH API approach would not scale for non-LF projects and GH archives is one way to handle the raw data collection for Github hosted projects.
An ideal solution which will require the project participation is to enable data collection via web-hooks but that will require significant work at the infrastructure level to process the events from thousands of repositories and implement effective queuing strategies to minimize data loss.
But hey, we already have some experience with the GH archives big data queries so something that we can consider and build a equivalent visualization layer to present the various metrics across commits, PRs, issues and releases.
@Matthew_Weber what are the most important metrics you would want to look at?