Okay, I think we'll go ahead and get started. Thank you everyone for joining. Um, so a little context on what we're doing, why we're doing it. Last year, elicit, we introduced uh our first version of the systematic review workflow on elicit. Since then, we've been doing a lot to improve it. And last week, we announced that it now supports Prisma 2020 standards. So you can do very rigorous SLRs on elicit and have the audit auditability and the traceability along with the accuracy and performance that you need to actually be able to take your systematic review feel confident to submitting it to a regulatory body or submitting it as a research uh article. But there's no better way to do that like we have statistics we have numbers on how well it works. I think the best way to do it is to do a live demo. Can we do a systematic review which you know usually takes six eight months to do and hundreds and hundreds of of manual hours. Can we do one in less than one hour? And so that's the purpose of what we're going to do today. Along the way, I'm going to show all the different features we've added to elicit systematic review. And I want to leave plenty of time for any questions that you might have. Um, when we run certain phases, for example, extraction will take a little bit longer. I'll tackle all the Q&A that you have. So, if you want to add questions as they come up, go into the Q&A tab and then you can add them. And if you want to be anonymous, totally fine. Just make sure you hit that toggle. Uh without further ado, I'm going to go ahead and get started. Okay. Cool. So this is the elicit homepage. If you haven't used Alysa before, Alys's trying to be the place where the the place the home where you can do evidence synthesis where you can come and do, you know, fully rigorous systematic workflows as well as the quicker help me find research in this area. Do I find any evidence that contradicts this? and uh everything and beyond in the life sciences. So we're going to focus on the systematic review workflow today. And the way we're going to do it is I'm going to take the protocol that I have in this tab which I've predefined before and I'm going to put it in and we're going to run it in a list. So to start with the way that I would use a list here is I would take this research question that I have go back to the home page and then just paste that in. So I have among adults with type two diabetes initiating a GLP-1 receptor agonist routine clinical practice I'm really looking for persistence and adherence in real world evidence over 12 to 24 months. Do they actually adhere to it in the real world? So I'm going to start with that and that's going to load up the review. And this is a new page we've added to the listed systematic review where essentially this is to define and set up your protocol before you go in to the the actual stages. There's a lot of stages, there's a lot of features. This is the one place to go in and say this is what I want to do and here's all the context I need. So I put in the research question as you can see. I'll come back to the additional context in a second and I will also come back to dual review. But we've added a feature where now you can use elicit as decision support or two reviewers or you can have a list be one reviewer and a human be the other. But what I really want to focus on here is a step configuration. Systematic review has many many stages and here we sort of lay out all the stages that you can do in elicit and you can do configuration. So first we're going to start off with gathering all of our sources. Then we're going to go into title Napstack screening. After that will be full text screening. We'll extract the data from the papers that made it through all the screening. And finally we'll generate a narrative synthesis from the data. I want to call out a few toggles here. One is strict criteria. This is also new uh or newer on elicit and essentially what this is is for a rigorous systematic review if a paper fails a single eligibility criterion then we have to fail it. I think that's the way that systematic reviews are done and so we've added that. So when you toggle this on any paper that fails even a single criterion will automatically be excluded. for more rapid reviews, scoping reviews and so on, you can turn this off and papers that fail a couple but are still important or useful will make it through. And same with full text screening, you can turn on criterion. And then in terms of extract, a feature that we have on Alysa we're very proud of is extracting from figures. So if there's tables, charts, graphs, you know, Kaplan markers and so on, Alysa can look at the data in those charts and images and pull out data and estimate data for your extraction. I'm going to come back up here to research question. So additional context, this is actually where you can fit in all of your protocol and then elicit will help figure out the rest. So okay, you have your pico criteria or your pico definitions from that. What should the eligibility criteria be? There's a toggle here where Alysa will auto suggest columns. If you use those in tandem, Alyssa can take your uh protocol and then come up with the criteria and then you can modify those. So that's exactly what I'm going to do. I'm going to go back to here. I have my pico. I have my eligibility criteria. I'm going to focus on just I'm just going to copy and paste those directly in. It's a bit of a text uh dump right now, but Alyssa will understand this and we'll use it in the actual um review. With that, let's go ahead and get started. Okay. So, now we're in the first stage, which is where we gather all the papers. Right now, this is doing a semantic based question search. is taking my research question and looking inside of our own internal corpus for all the papers that might be relevant. Our internal corpus uh elicits internal corpus has about 138 million papers. So I'm sure we'll find some great results. But we've also, you know, for rigorous systematic reviews, keyword search really is the name of the game. So we have the ability to do that as well. And so what I'm going to do is I'm going to add a new tab. And so there's a couple options here. I can do another keyword search. I can do a key uh sorry, I can do a semantic search. I can do another keyword search. or I can upload papers. I'm going to add papers from keyword search. And one thing to note is we've also added other databases. So we have elicits internal research papers corpus, but we also have PubMed and we have clinical trials.gov. For this systematic review, I'm going to focus on PubMed. And I have the specific search strategy also in the protocol doc, which I'm just pulling over. Can put that in search papers. It'll take a few seconds to go get the papers from PubMed and pull them in here. Now, one thing to note is that, you know, you can do multiple searches. They'll all flow into this one tab where you can have all your papers gathered together. And we can actually support up to 40,000 papers per systematic review. Um, and so that's where you would start with them all here. Whether that's from your semantic searches over listenit your keyword searches or if you want to do something on MBASE or another database that we don't have in product yet, you can always go there, do the searches, and then if you upload papers, you can do a bib file, a risk file, or if you have the direct PDFs, you can upload those as well. So, this is going to take a few more seconds to run here. I'm going to go ahead and delete this among adult one. I this it returned about 10,000 results. I'm sure many is irrelevant, but I would like to stick with the PubMed one for now. Okay. And one statistic to note on the uh search is when we do elicit semantic search, we've tested it at at about 5,000 and 10,000 papers. We get about 95% of the papers at 5,000 that you'd want in your in your systematic view. But we our keyword sourc has loaded and we can see all the papers over in here 984. And if you copy and paste this uh keyword query directly into PubMed, you'll get pretty much the exact same number of results. Okay, that's the search phase. We've gathered the papers that I like to do in the systematic review. Now, we're going to go on to the screening phase. So in screening, of course, the objective is to take all of our papers and then remove all the ones that we know we definitely do not want in our data analysis and extraction. Here we're focusing on the title abstract screening phase. And one thing we really focused on with elicit is we make it as inclusive as possible. So to get really high sensitivity, you know, we as I think is the the common knowledge, a false negative is probably the worst thing you can get in a systematic review. If you miss a paper, that's really important. That could change your results drastically versus false positives. You still don't want them, but a little bit easier to remove and separate out. So, we really ered towards making sure we get all the right papers in. The way screening works in elicit is we have essentially eligibility criteria questions. So, you can add your own. You can see the elicit suggesting some based off of the protocol I added in earlier. But the way it works is, you know, I'm going to ask it a question like, you know, is this study focusing on adults? And we can just see a couple here. Appropriate population focus. Is the study population not primarily composed of pediatric patients and some other groups that we don't want to see? Cohort design. Does a study use, you know, this compound uh questions that we can do in here? Does a study investigate proved GLP-1 receptor agonist? And the way it works is you'll see it's already starting to populate. There's a couple there's each criterion is evaluated against the paper and then each criterion returns a yes, a no, or a maybe. If any of these are a no, for example, a short-term cause of this paper, since it has a no and cohort design, we know this is just going to be removed entirely from the actual systematic view. So, I'm going to go ahead and evaluate screening and then I'll come back and explain more. So, now it's taking those criteria and running it against all 984 abstracts. this previous screen. What it's doing here is it's actually selected a 100 randomly from the 984 and the purpose here it's like a pilot phase. Essentially what this is useless for is we have these criteria where are the may occurring we can then go and see okay it's uh on the appropriate population focus is our criteria clear enough for Alysa to understand are there any areas where I disagree with it for example this no actually I think it should be a yes based off of me reading the paper then I would go in and see okay based off of that do I need to modify next criterion you modify it or update and you can do that until you feel uh comfortable and satisfied with how this is performing Okay. So now list is going against all the abstracts. It's about 12% complete. It'll take a couple minutes. Um so as well in this phase we've done the pilot. Now it's running against all of them. We've evaluated that's how the screening works the title and abstract phase. We've done an evaluation of this as well when we released uh we announced our release last week. So let me switch over to show you what we've found. Okay. So we evaluated title and absent screening in elicit against 108 Cochran reviews which in total comprised 6,93 individual papers that need to be screened in or out. And using our our system we got to a sensitivity essentially of all the paper that should be screened in. How many of them did we get of 96.9% and a specificity of all the papers that should be left out? How many do we ignore? 92.5%. And we think these are pretty good numbers. We compared it to a dual review human benchmark that we found from uh Gartner 2020 of about 97.5% sensitivity, 68.7% specificity for two humans uh doing systematic abuse. So we think we compare favorably in this in this dimension, but of course evaluations differ. All right. Any questions at this point while we wait for this to finish? Cool. I'll continue, but of course feel free to add in questions in the Q&A. So, I've talked a little bit about how the screening works on the title and abstract phase. I've shown some statistics showing we've got pretty good results in terms of sensitivity and specificity. But of course, the main thing is we want to make it really easy for researchers and reviewers to verify every single decision recommendation that OSA makes. And then over time, that will build trust. So there's a couple ways we do that. One is for every single paper, if you click into it, we have every single criterion that it's evaluated against. And with every single criterion, we also show the rationale for why it either passed or didn't pass that criterion. And we have source quotes. And we also have a reading mode because this is more of the table view. But if you really want to focus and every single data point, we have a reading mode here. This paper, for example, appropriate population focus. Okay, it's elicit status study focus is adults with type two diabetes. How do I know that? I click on the source quote here and it will immediately highlight within the abstract the specific quote that gave Alysa confidence to either pass or fail this criterion. And you can do this with every single criterion in every single paper. Cohort design uh also this thing also this quote and also there's another quote that um made sure that it had six months uh utilization beforehand. So you can check every single decision. Of course, sometimes you might disagree with Alyssa's decision or you want to overwrite it. And that's also possible. You can very simply just say, okay, Alyssa said this should be an include. Actually, I'm going to say this is an exclude. And I can select the exclusion reasons as well. So, I would say actually, you know, I think color design was off here. In this case, it's not, but if it was, I could choose that and then hit save and X. And then you can do that with every single paper. I'm going to jump down to some of the excludes a little bit lower. Okay, here we go. These are papers that elicit things should be excluded because they're failing at least one criteria. So here I'm going to go into reading mode. Elicit will tell you, okay, it's exclude. It'll also tell you why it thinks it should be excluded. What's the primary exclusion reason? And the way we do that is essentially we try and go in pico order. if there's a if there's a failing criterion on if there's multiple we go in people order to say which is the most important exclusion reason in this one it was sample size um and I can go check okay why is sample size uh an exclude and that's the wrong one let's see yep this analyzed 203 patients in our criterion we requested that it be more than 500 and so I can go and verify that that is correct and then I can do this for every single paper and uh so on. And then one last thing to note is of course there is exports. If you want to be able to take this out into your own systems to do analysis, you can download as a CSV or an Excel. And in those CSVs will have every single paper, every single criteria on the judgments, the ration, and the source quotes as well in case you want to check those against other systems. All right. And so we've added a lot of these features more recently primarily to make sure that we have Prisma compliance status. And when you do your screening for example elicit in the final report will show you. Let me go back down the excludes past all the includes here. Okay. It's going to be excluded due to sample size. It will also list that in the Prisma flow diagram that we do at the end that shows okay um what are all the reasons why you know 660 abstracts were excluded. It's going to show you 20 of them were for sample size, 10 of them were for um you know the slider design was off or something like that. All right, I see there's a question. You can use the chat. You can also use the QA feature if you would like. How does this compare versus nested knowledge? It's a great question. I think nested knowledge came from the more manual side of the systematic review. So they've built out you know meta analysis, critical appraisal and so on and and so forth versus we are coming more from can we build in AI from the very beginning. But I think we have a I think within the the world of AI systematic review we have a a pretty defined philosophy. We have some things I have some blog posts I can link you to. But essentially when you try and use like a chat GBT or a generic AI tool there is a lot of it tries to cram everything into one answer versus we have some philosophies around like factory cognition. Essentially we need to break down every single large task into smaller tasks at at the smaller task level elicit or AI models can do more accurately and uh be more traceable. And so with those things combined, we can bring in AI and have lots of oversight and um accuracy at it. So different approaches to the to the world of systematic review, but I think we have our you know a similar mission at helping people be able to do much better research, much more research, help systematic views keep up with the amount of evidence piling up out there. Okay, all the abstracts were evaluated, 984 of them. I can go in and review every single one if I would like to, but I'm going to move on to the full text screening stage. In the full text screening stage, it's very similar to the title abstract stage in that we have the criteria on that we the criteria that we selected before. I might add one here because there's some details that are only available in the full text that are not in the title abstract. And then the secondary thing is we need to have the full text. I'm going to go ahead and there's one criterion I want to add. I will add that here and then we can kick off the text screening. Okay. Okay. So, very similar to the title abstract stage, but we're going to do only full text. There's a big question here that of course I'm sure you're thinking is how do we get the full text? So Alyssa actually has a couple ways to provide the full text to this stage. One is that Alyssa will go out and try and find all the open access um papers that we can. So for example already we have about 24 full text that we were able to find either their full text or their conference abstracts. So we find them uh automatically for you. We also have a Chrome extension which if you use that and you have institutional access, I do not, but if you have institutional access, you can use those credentials to get the papers automatically from the journals you have subscriptions to. And then finally, we show all the abstracts that we couldn't get full text for ourselves. And so what you can do is we'll show you the exact link. If you click on this, it'll take you to the exact uh paper and then you can get the PDF in some way and uh upload to elicit here automatically. Once it's uploaded, that abstract will just be processed like every sorry that text will be processed with all the rest of the full text. So we'll let that continue and then let me share this stats actually. So I'm going to go back to this similar to tell an abstract you can click in you can see the reading mode and I think it's actually even more useful here where there's a lot there's many pages per paper and so okay focus on adults with type two diabetes where is that coming from this case it might be the abstract um but if there's something let's see here oops see yes it will jump to the place in the paper where the the information was pulled from and as you can see here in the screening we're actually looking at tables itself. So we can see you know for example this has percentages for you know the the withdrawal percentage of 12 18 24 and 39 months and so we can use that information even if it's in a table in a figure we will show you exactly where it came from in the paper. Okay before we go on I'm just going to share one stat here. We also evaluated against um full text screening on elicit. We took 74 cockin reviews and 377 full texts. And then we looked at sensitivity and specificity. Again here we saw about 99.5 sensitivity and 70.1 specificity. And then we also looked at the per criterion level which is which is to say for every single criterion how many of them do we get correct? And we saw about 94.8% accuracy on the per criterion basis. And where we got it wrong was that we overindexed towards maybe the clues instead of excluding again for the same reason of we don't want to miss any papers. All right, so screening is done. Out of the 984 sources that we originally saw, we have about 40 sources included here. If we wanted, we could also get the full text for the 50 that we could have that Alyssa couldn't find for us and then go from there. So now we're going to go into extraction. Extraction is essentially for all the sources that we've included. Let's extract other data points that we actually need to be able to do our analysis, to be able to do our synthesis, and so on. Alyssa is going to suggest some, but I actually pulled I wanted to show what it's like to create some custom columns. So I have some that I'm going to pull from the protocol. So give me a second to pull those in. And you can see that we're extracting from figures as well. There's also an ability to have an answer structure. Generally, I default to any which allows elicit to essentially, you know, look across all the data and then be able to pull out into the right structure and right format based off your prompt the answers. But if you know that you only want, you know, like a uh like there's like a couple choices you want to choose from, you can always choose that. So for example, you know, it's going to be an observational study versus an RCT or something. You can define those and you can do yes, no, maybe in case you want uh answers in that structure. But I'm going to default with any right now. So that's one. Let me add the others. And again, if you want to see where I'm getting these from, they're in the protocol Google doc that I shared. Okay, that is going to run. Let me let it start running because it's going to take some time. This is probably the the lengthiest part of the elicit systematic review because it's going to for every single extraction data field you defined go against every paper and pull out all the data, add all the quotes and make sure that nothing is unsupported. In the meantime, the interface looks similar to the screening phases and that's on purpose with using AI using elicit prompts do matter. So if you the pilot phase here is really to make sure that when you put in your criteria the format the information that elicit is pulling out is to your liking. And if you don't like something again you can go in you can specify a format. I did not specify a particular format on these ones, but you could say, you know, use parenthesis for um standard deviation, brackets for the means or vice versa. Add those in and the list will will follow that. You can iterate and then make sure that the extraction is to your liking before you do it against all of them. And so we've selected 10 here. All right. And then Alyssa's also suggested some columns based off the protocol I put in earlier, but I've kept these off for now because I like to go with my extractions. Okay. So, this is going to take us time to do this. As before, there's a download so you can export to CSV and Excel and show exactly um what are the papers, what are the columns that we're pulling, and what are the source quotes from each paper that show why that exists. Let me see if I can show a couple here. Okay, so this is an example of an extraction patient population. I gave it a long list of things to pull out. It's pulling it out from the paper and you can see at all times the quotes that it's pulling from and you can go through every single quote pulls from tables, charts, all kinds. And then when it's not mentioned, it I put in the prompt to say, you know, not mentioned, it will say specifically not mentioned. Oh yes. So this is a more complex diagram. um key eligibility criteria elicits essentially looking at this figure and saying okay what do I understand about it and then using that to inform the extraction and the analysis that it's doing. Okay, so we're going to let this run for a second. I'm going to take a pause. Okay, cool. There is a couple questions. Let me answer them. Um, can we add or create our own agents in elicit instead of just using the preset re preset research agent? So, that is another workflow that exists in elicit. Let me see if I can pull that up. I will stop sharing. Share this tab instead. Okay. So, on the homepage, if you notice, we had selected systematic review, but the base is to do a research agent. At the current moment, no, the the main agent is the research agent that Alysa has defined. We put a lot of effort into the prompts and the tools and so on that it has to make sure that everything is verifiable, everything is backed up and it's never being, you know, finding evidence where there is none or overextending claims. So, at the current moment, no, but I believe we made it flexible enough that you could go in and do sort of any task that you want to do with it. If there's something in particular that you're curious about, drop it as a question. I can also answer that in terms of what kind of agency you're looking for. Okay. Does it make sense to check all decisions in title abstract screening by human overview or should a specific percentage be checked? So I think this is something where the industry has not yet settled on. I know a lot of tools do have um priority screening or check you know at 20% and if it's all good after that point then you can just let the the system take over. From the research I've seen online even these kinds of systems leave some valuable papers on the table. So it feels like at the current moment we cannot just trust yet a system to do all of it by itself. So I think human overview is going to be required but I think that might change in the next you know year or two as we get more validation for example if elicit is performing at 97% sensitivity and 92% specificity beating dual human review. I think that's some evidence that actually we should go into let's check a percentage let's take a random sample and so on. So I think for now it still makes sense to check them and we've added in some features like the reading mode to allow you to just quickly check every single um decision very easily. But over time we hope to cut that down to where you check a percentage where you can do uh calibration within the review and then and show that is it possible to show off KM curve digitization? Yes, it should be. So the way that it works in elicit is we don't you know we don't replicate the diagram in as an internal thing but what we do instead is we'll look at the KM curve we'll try and extract out all the data points if I can find one in here I will I'll try and show it to you should be somewhere should be probably in primary outcomes Okay, let's see. Find a case where that does work. Okay, so for example, persistence to 12 months here. Here what elicit is doing it's got this figure and based off of this figure it's estimating that I I'll have to find the exact number from the KM curve 48 to 50% at 12 months let's see I think this might be a case where overestimated and it might be closer like the the 60% uh 40 to 50 is about 10 percentage points off. This is probably on the the larger end of errors, but it can digitize from the KM curve. And essentially what it's going to do is try and estimate based off of this sort of eyeball if you go to the 12th and then go up what is actually did I misunderstand that? Yeah. uh it's going to go eyeball up from the the the 12 the 12-month mark and then look at what is the chart saying here. I think it's probably closer to 60% versus Alyssa was saying 48 to 50. In other cases that I've seen, it's also it's been a little bit more accurate. I've seen on average the estimate is within about 5 percentage points of the actual value. Okay. And then I wanted to show one more thing while we wait for extraction to load which is oh sorry I realized I was on the wrong tab. This is the KM curve digitization. So here Alysa is looking at the curve. It's looking up at the 12 uh month mark and it's saying okay this is right around the the 0.6 six probability of of persistence whereas I think Alysa is is overestimated that has gone to 58 uh or like 40 to 48 to 58 percentage points. So Alysa usually is within about 5 percentage points when I've looked at other examples. Okay, I want to show one more thing while we wait for extraction uh to uh load which is dual review. It's a new feature that we've added and I didn't do it for the sake of this one because that adds another person who's going to take a little bit more time. But we do have it as a feature and there's two ways it can work. Elicit can either be decision support. The two individual humans will review Alys's recommendations to make sure they're not missing anything and then they're making their decisions and they reconcile or Alys can perform as a second screener in which case you run elicit like I'm doing here. You download the Excel and then you compare it to what a manual human has done with their Excel and reconcile. Let me show you what the the elicit as decision support for two humans looks like. Believe that should be this tab. Okay. So, this is a separate this is a separate review that I had set up beforehand. It's with a far fewer papers. It's only got 10 papers in here. And what I did is I did I assigned dual review. So in that first um protocol step you can say add a second reviewer. I add in my other email so that I could do both sides to show what it looks like. But essentially what you'll see here is title abstract screening is finished and then I've started dual review. And so now what happens is elicit has made its screening recommendation. It's up to me to say do I agree with this or do I actually have a different decision? So I'm going to check every single um decision. Let's just say, you know, I don't agree with this one for whatever reason. I think it should be a maybe because uh I've dug through it and I found some criteria like human adult participants is strong enough of a criterion that it's getting maybe on that I think it's maybe. I can save that and I can also add a note for later use and reconciliation. Okay, then I'm going to that one's been marked. My decision is maybe I'm going to accept this one. I'm going to accept this one. I'm accept this one. Let's say I think this should actually be an exclude. I'll save that and then I'll go forth. So now what's going to happen is on the another human is going to do the same thing across all of Alys's papers independently. They're not going to see my answers. So they're blinded. They will see elicit screening recommendations though. They're going to do the same thing. And then once they're done, which I've already done beforehand, we can reconcile. So what that means is I will actually see okay here's what Alyssa recommended here's what I reviewer one uh decided and here's what reviewer two which is the peer decided and so we can see where we disagree and really the disagreements are going to be between my decision and my peer's decision. So here I'm going to see my decision and the note I added earlier. Reviewer one's decision, the note they added earlier. And the intent here is however review teams want to work. Maybe they do it over email. Maybe they get into a room and adjudicate. Maybe they bring in a third person to make those decisions. However that wants to be done, that can be done. And then reviewer one, who is the creator of this the systematic review, can just put in the final decision that was made. So here, let's say we're just going to include it. And here we're going to exclude it. Okay, that will finalize that. So at this point, I've locked in my entire title abstract screening. Alyssa has done his recommendations. Human one's done their recommendations and then human two has done their recommendations and we reconciled. And so now we have our final uh title abstract screening. But now we've also added some statistics which I think are really helpful for calibrating how well elicit's doing. So what we've added here is okay one reviewer one how did they calibrate with elicit we had an 87.5% agreement rate across the papers and when we uh excluded a paper we have the same exclusion reason all the time and then we also have the ability to do coins kappa but it's pretty unstable underneath about 50 to 100 papers so it's available after you've added about 50 papers and the second thing is how do the reviewers agree with each other so reviewer one versus reviewer two what's their agreement rate exclusion in agreement and same thing with the cap up. And then when you do do a review, we also have an audit trail that I think is super helpful. So let me show you this. That was review stats. We also have review history. So for example, this shows you every single decision that was made. What elicit recommended, what human one recommended, and human 2 recommended. So for example, this will show every single decision um and where we disagreed. And same with reviewer two. We can show every single decision where we disagreed and the notes. It'll show you exactly when each phase is reached. So you locked your protocol in, you started screening, when did you go to conflict resolution, what were the conflicts that were resolved, include which ones did you resolve and then when it was done. And you can also export the CSV. So if you want to put in your um Prisma reporting documents, you can put that as well. So that is dual review. You can do it that way with two humans on elicit or you can do it where um elicit is one screener or reviewer and a human is the other. Okay. Let's see. Couple more things to note. So extraction is about 70% done. Again, this is going to be the longest part of the systematic review um because it's going against all the criteria and pulling it out in detail, finding the quotes, analyzing the figures. We've talked a little bit about, you know, auditability and traceability are super important at every single step of the way. We've shown that in screening where you can check every decision, see the rationale, override it. And then the same thing in extraction as well. You can for each paper and each uh extraction field see where it it came from in the paper see the rationale that elicit had for why it extracted that data out and then in terms of uh editing it you can always go back to extraction results change the results or if you want to export um you can export download the CSV edit it there and we have on our road map coming up also the ability to edit inside of this view as well. All right. And then let me talk about the stats of this a little bit. So I shared stats on title abstract screening, full text screening on extraction. What we did is we took about 28 carpent reviews and about 111 papers that full text papers that came with that and we focused on extracting out the methods, participants and interventions the same way they did in the carpent review. And we got to about 95.6% 6% accuracy with uh that evaluation. There are some caveats to it that I would recommend you check out in the blog post where I think when you in cockch review sometimes it's underspecified what you want to extract out. So we sort of had a synthetic question that was formed from the extractions. It's a bit involved but I think um you can go check it out on the blog post. I would still stand by the 95.6% number as being accurate for the methods participants and interventions. That was our internal evaluation. We also have had external evaluations independently done. For example, Lee Edal from earlier this year evaluated an earlier version of systematic review and found that we had about a 1% hallucination rate which is approximately which is actually the same as humans that they measured in their in their study and the overall error rate actually was lower than humans. Since then, we've increased our accuracy um improved our system and its performance, which is why we're seeing 95% accuracy rates, uh which is better than before, but we've had both internal and external evaluations. Okay, let me answer some more questions. So, there's a couple questions. When you want to add PDF to your search that was not downloaded automatically, what is the best way to feed it back to your current search? I had problems in the past with this. Uh I had to start everything from scratch. Let me see. Oh, okay. So, in terms of adding papers or PDFs to the systematic review, there's a couple things. One is at the full text stage, which is really when it starts to become important, you can always upload the PDF there to the specific abstract that is missing a PDF. So, that will always persist through the rest of the systematic review. Or if you know you need some PDFs for sure added, you can upload it directly in the search stage or into the library and then pull it from the library into the systematic review from there. Okay. And then there are three questions here. Are there any copyright issues if I upload a PDF that does not have access to? That is something to check with the journal that you're specifically pulling from. There might be and we're working on ways to easily give you the ability to check whether you have access but it's something you'll have to check with the specific paper. Is a risk of bias assessment already possible or is one being planned? So you can always add in in the extraction state risk of bias columns but we are specifically planning on adding critical appraisal to elicit. So that'll be a separate stage where you can say you know what's the risk of bias for each paper and then do analysis based off that. How realistic is it that Alyssa will one day be able to conduct meta analysis is the last question here. Um very realistic. We have it on our road map coming up soon. So we have the systematic review going to have critical appraisal. Once we have critical appraisal we're going to um essentially add in the ability to do meta analysis on top of it. So you can do quantitative synthesis as well as the narrative synthesis. Let me jump over to Q&A for a second. Okay. Now that was already addressed. Okay, I know you mentioned that elicit is different from ENK and that it's not AI native, but what about AutoSR? Uh to be honest, I have not used AutoSr. I know they've put out some great validation numbers which are great to see. I have not used AutoSr. I don't know. I cannot speak to their traceability, auditability, and reproducibility. I just happen to say this seen the system but I will emphasize that the the thing that we've tried been we've been doing from the start is focusing on every decision should be traceable auditable reportable so when you go up to submit this to as an SLR and you have to do the Prisma documentation every decision you can report and say exactly where it came from which system used it and when it was run okay let's see extraction should be almost And cool extraction is complete and now we go on to generating the report. So in terms of generating the report we'll see as it runs here is sort of you know the Prisma funnel. We'll have an explicit Prisma diagram that shows up at the end but we looked at 984 sources. We screened in 230 of them at the title abstract stage. At full text screening we looked at we included 40 sources from that 230 pool. And then from the 40 sources, we extracted about 200 data points which are going to go into the final report. I'm going to let that generate for a couple minutes. But in the meantime, I'm going to show a couple cool things that um I think are actually let me show a done report and we can let this report generate. So I had done essentially a very similar essentially the same protocol couple uh yesterday to just you know check it out and then I generated a report of it this morning. What this is doing is essentially it's taken all the data points and then looked at the research question you had in the beginning and the pico criteria and so on and then tries to come up with essentially a and review or um very rigorous analysis of that. So we'll have our summary. We'll have our abstract which essentially summarizes across all the evidence. And then I want to specifically call out the method section because this is going to be very important for showing that it's rigorous and done in a Prisma compliant way. We have our Prisma diagram which shows you all the steps that we did. And if you let me zoom in a little bit. Essentially, it shows you papers that came in, the numbers, how many were screened out at each stage, and the specific exclusion reasons for each of the papers done of both title, abstract, and full text. Shows you the full search string, the number of results, shows you the exact screening questions that we use as well as the um the the number of papers are excluded. And then in terms of extraction shows the exact extraction criteria or definitions we had as well. Cool. And then the rest of the report goes into what are the you know the things you'd expect in a review of this kind. What are the character characteristics of included studies and then it shows you all the the data for that. And it will show you also the claims why it says that. And again we'll go into the the paper and show you the quotes from the underlying paper as well. And it'll do this pretty consistently throughout. It will do here are the claims. Um ah apologies. Let me rewind a little bit. Apologies. So this would be what a finished report looks like. Essentially I ran a very similar one earlier and it's going to look at the research question. It's going to look at all the data points that we pulled out and essentially come up with what is the synthesis across all the data while also you know factoring in that some papers might disagree and so we'll temper the evidence. So it has an abstract and a summary in the way that you expect. It has a method section which shows you all the steps we did in the in the actual process. So for example, it has a Prisma diagram and if we can zoom in a little bit, it shows you the number of papers we pulled in, what are the app check screening criteria we used, how many papers were screened out for each specific criterion, and then it'll go into full text screening, the number of papers screened out of full text screening based off of those criterion, and then the number of papers we finally included. It also includes the includes the full query that we used the database we searched against and then it lists out in detail all the screening questions and the extraction definitions we wanted as well and then past that it will do a characteristics of included status studies table. So for all 34 papers that we included what are the characteristics you know that we wanted to look at and what did it show and then for each of them it will show the full quote as well where in the paper that claim comes from and then it continues on essentially summarizing and synthesizing all the evidence to show what are the persistence and adherence rates at 12 months. It will show again tables showing each paper what it shows and then summarizing or synthesizing across those numbers adherence and so on and then often they'll do subgroup analyses um dosing schedules and so on to essentially get all the subgroup analyses you would like in your in your review. I will caveat that essentially as it stands the report that Alysa generates is more of a first pass. There's a lot of stuff here that you want to add. We want to be able to do meta analysis. you want to do more papers and do more rigorous checks of all the claims. But as a first pass, it sort of gives you the lay of the land. All right, let's see how the report is going. Report is still generating. Take a couple more minutes. Let me answer some more questions. I know this presentation is focused on a systematic review of published studies, but can you provide any information or suggestions for using a list for something like a landscape scan that also looks at gray literature and websites? Yes. So there are two answers to that. One is as part of our corpus, we do have some gray literature conference abstracts and so on that are in there. And so when we do a systematic review with with the elicit corpus and you do a semantic search or something, we will pull those in. But if you want to get even more and get websites and so on, I would start with the research agent on elicit where it will essentially be able to scan across the web, find all these data sources and compile into there. And something we have planned that you know we will be implementing soon is can we combine those two. So you do your web searches, you pull those in and do a systematic review over the results you found from the web search as well. Okay. Are we going to receive the recording after meeting? Yes, absolutely. We will be sending out the full recording. And then Cochran is evaluating tools based on their alignment with responsible AI and evidence synthesis uh raise recommendations. Have you submitted your tool for evaluation? Yeah, we are talking with Cochran about joining their study and what the steps would be. Um we have a we have internal uh our evaluations are try to sort of align with the the raise guidelines in terms of showing here is independent or validated performance of each stage of the pipeline which is a big that raise guidelines look for. Another thing is we've also added a lot of human oversight. That was something we've always uh wanted from the beginning and we've added more and more stages of it. So for example, dual review um being able to do the review stats within dual review. So you can see where does elicit agree with me sort of calibrating as you go to make sure that beyond just independent validation or external validation elicit also is doing well in your specific review. And then the final part is all the reporting that we have at the end um you can use to make sure that when you use elicit you can report it accurately in your submissions. Yeah. Uh the Lee paper says it's about 20% error rate in both elicit and humans but if I remember correctly you had mentioned 1%. Yeah. So error rate versus um the hallucination rate is different. I believe Lee defined hallucination rate as separate as more like just confabulating evidence that's not there there I think it was 1% which is similar to humans and then the other thing is the 20% error was on a previous version of elicit from I believe almost you know the paper came out in February but I think it was using a version that's probably a year old the newer version I think is showing maybe like closer to a 5% 4% error rate all All right. And can literature searches conducted in listed be considered reproducible if they're not run in PubMed? With PubMed, replication is straightforward because the exact search string can be copied and rerun. How can someone replicate a search perform on semantics broader corpus, especially if the underlying query logic isn't fully transparent? Yeah, that's a great question. To specifically combat that, when I ran this PubMed query, I ran it against specifically the PubMed corpus. So, if you take the results, uh, let me pull in the exact query we ran, you will get. So it's it's in the exact PubMed syntax. So it will show. Let me see if I can pull it. Also sharing that we finished our review. Um it's been we have about 8 minutes left. So the hour end. So we finished the end systematic review within an hour. So we did that successfully. The report is going to be very similar to the previous one that I shared because it's the same question with very similar sources and of course the methods. But what I'm going to show is if you take this exact query, we run this query against PubMed and let's go to PubMed here. Going to run it. Should get about the same. So 984 versus I think 985. I think there's one paper that I think maybe PubMed added since we uh started or something like that. But essentially it should be the exact same papers that flow through from PubMed will flow through elicit. So, when you report the PubMed query or the keyword query, it's going to be reproducible every single time, especially if you put the date filter. Does that answer your question? If not, um, let me know if there's a follow-up you want to address. Okay. So, we've done our review. We started from protocol. We ended with the report. And you can always save. You can download PDFs. You can download the BIP files that are associated with this, the risk files. You can also download as we mentioned throughout every stage the CSVs and excels especially the extraction stage I think that's the most valuable thing download the CSV to show all the extracted data verify that in the future we'll be able to do critical appraisal meta analysis on top of this so this extraction data will be used to do that meta analysis and then finally we generate the report including citations and when you download the PDF for the report the citations will be moved to the APA format and have numbers instead there's a couple things I want to add now that we've um um hit the end. One question is how did I come up with this protocol? Usually this would take weeks of you know checking against the literature. Um I am not a librarian. I don't have necessarily the expertise to define a protocol and do it super well. So I actually used elicit to help me define the protocol to run in this and I'll show you how I did that. So I'm going to take the research question which is simply among adults types of diabetes using GLP1 receptors receptor agonist what is the persistence in appearance and what I did is I went to the research agent and I said can you help me define a protocol for a systematic review on this and I often prompted to do do some searches is against. So I put that in and then the research agent actually has access to the full corpus of elicit as well as keyword search clinical trials I got pubmed and so what it can do is it can do queries against the corpus and essentially do like a mini scoping review where it looks it does searches finds papers tries to figure out what are the anchor papers we want in every single systematic review that would go here like what are the really important papers and then helps me formulate queries then make sure that I get those papers in there and then it looks at papers that should be included versus shouldn't be included and tries to figure out what are the screening criteria that really matter to make sure that the papers that do make it through are of high high quality or high signal. So here it's finding a bunch of papers. It will continue to do that and then it will come back with a systematic review protocol in just a second. It's going to take a second to do that. It's reading through all the sources and similar to the systematic review workflow, the research agent also has access to the papers. It will do summaries. It will do quotations as well to make sure that every analysis it's doing will show up with quotes and rational for why it's included. Here we go. That's like almost done like a mini version of the systematic review in about a minute. Of course, I would never submit this for rigorous submission, but it does help me inform inform me on what I need to put in my systematic review. That might be a bug with the the hes, but nonetheless, here we go. Draft systematic review and protocol. The background rationale. What should the interventions, comparators, outcomes be? Study design, what should be included versus excluded? I think what I would probably do is iterate with this a little bit more, look at more searches. But I think in about 10 minutes, which is what it took me when I first did it, I got to a pretty sizable uh version sizable first version of the protocol. That's I have one more thing to share as well. We also released an API. I think it's maybe the world's first systematic review API. And the reason is you can do a systematic review on elicit using a lot of elicits corpus but if you're you know form a team and you have you know hundreds and hundreds of research questions to investigate how do you do this at scale how do you make sure you don't drop the ball in any single one of them and how do you bring your context into the systematic review. Let me show you how to do that in the two minutes we have left. Okay. Okay. Using everyone's favorite claw code here, but basically I've set it up with my API key already and I have a bunch of protocols in this folder that I want to do and you can fire it off like that. So you can also do it more programmatically where we write a Python script and say do a systematic review for each of these protocols. But I've defined them already in a folder. Um and then what it's going to do is look through that folder, look to get my API key and just fire off a bunch of requests to um listen API to do the systematic reviews. And then the neat thing is once it fires it off, it also gives me a URL that I can see the systematic view running in real time. So let's see if it does that. It'll take a second. The docs, which you can also go access at docs.alista.com com and it should start firing them off in about a second. Come All right, we're just about time. I think it should be almost done firing them off. And so the whole purpose of this is when you have, you know, dozens and dozens of research questions, can you just do them in one go? Yes, you can. No. Give one more minute. While it goes, any questions on the API or on anything we've we've discussed today? Cool. While it goes, I'll just do the the final wrap-up, which is we've done our goal, which was to start with a protocol and end with a report that covers the search screen, uh, title, natural screening, full text screening, and extraction phases of the systematic review. We can of course scale it up much more. I only looked at 984 papers. We can do up to 40,000 per list of systematic review. So you can really make sure that you get every single paper that matters. I will share out we'll share out the recordings of this webinar so you can go back and check it. We will share out um the the finished systematic review as well and then I will also see if I can put these uh final systematic reviews that r the API in there as well if you'd like to check them out. But thank you so much for attending. this is starting to do a systematic reviews. You can extrapolate and see that it would run the systematic review against it. It's got a couple errors. We'll fix that and then and then go run the systematic review and then you could check the actual results of the systematic review API, download them, pull the extraction tables and do analysis all from um from the API. And the final thing is we've seriously upgraded our systematic review. If you want to chat more, feel free to reach out to me at hamsaissa.com. I can put you in contact uh I can answer any questions if you want to upgrade to scale or enterprise which is where these plans the plans that have these upgraded version systematic review live I can also help you uh figure out the way right way to do that but this is going I will send the results in the the final doc um thank you so much for joining Cool. Yeah, for the people that are still left here, there's it ran this. It's running the systematic reviews. It's going to give me the URL and I can show you what it looks like to actually see it running. Let's see. Stop sharing that. Share this. Believe this is it. Yep. So this is the Sithic view. We started from the API and it is already it's already into the the screening phase. This one is just doing title action screening, but we're already about 16% of the way through screening. I can modify here as well and take control, but otherwise it'll just continue on all the way to generating a report. And with that, thank you so much for joining. I will talk to you later. Bye.
A walkthrough of completing an end-to-end systematic review in Elicit in less than an hour. Recorded as a webinar on May 13, 2026.